Below is a review of the June 16th, 2009 version of this paper:
Shimon Edelman, On what it means to see, and what we can do about it, in Object Categorization: Computer and Human Vision Perspectives, S. Dickinson, A. Leonardis, B. Schiele, and M. J. Tarr, eds. (Cambridge University Press, 2009, in press). Penultimate draft.
I will refer to the article as OWMS (On What it Means to See).
The goal of Edelman's article is to demonstrate the limitations of conceptual vision (referred to as "seeing as"), criticize the modern computer vision paradigm as being overly conceptual, and show how providing a richer representation of a scene is required for advancing computer vision.
Edelman proposes non-conceptual vision, where categorization isn't forced on an input -- "because the input may best be left altogether uninterpreted in the traditional sense." (OWMS) I have to agree with the author, where abstracting away the image into a conceptual map is not only an impoverished view of the world, but it is not clear whether such a limited representation is useful for other tasks relying on vision (something like the bottom of Figure 1.2 in OWMS or the Figure seen below and taken from my Recognition by Association talk).
Building a Conceptual Map = Abstracting Away
Drawing on insights from the influential Philosopher Wittgenstein, Edelman discusses the difference between "seeing" versus "seeing as." "Seeing as" is the easy-to-formalize map-pixels-to-objects attitude which modern computer vision students are spoon fed from the first day of graduate school -- and precisely the attitude which Edelman attacks in this wonderful article. To explain "seeing" Edelman uses some nice prose from Wittgenstein's Philosophical Investigations; however, instead of repeating the passages Edelman selected, I will complement the discussion with a relevant passage by William James:
The germinal question concerning things brought for the first time before consciousness is not the theoretic "What is that?" but the practical "Who goes there?" or rather, as Horwicz has admirably put it, "What is to be done?" ... In all our discussions about the intelligence of lower animals the only test we use is that of their acting as if for a purpose. (William James in Principles of Psychology, page 941)
"Seeing as" is a non-invertible process that abstracts away visual information to produce a lower dimensional conceptual map (see Figure above), whereas "seeing" provides a richer representation of the input scene. Its not exactly clear what is the best way to operationalize this "seeing" notion in a computer vision system, but the escapability-from-formalization might be one of the subtle points Edelman is trying to make about non-conceptual vision. Quoting Edelman, when "seeing" we are "letting the seething mass of categorization processes that in any purposive visual system vie for the privilege of interpreting the input be the representation of the scene, without allowing any one of them to gain the upper hand." (OWMS) Edelman goes on to criticize "seeing as" because vision systems have to be open-ended in the sense that we cannot specify ahead of time all the tasks that vision will be applied to. According to Edelman, conceptual vision cannot capture the ineffability (or richness) of the human visual experience. Linguistic concepts capture a mere subset of visual experience, and casting the goal of vision as providing a linguistic (or conceptual) interpretation is limited. The sparsity of conceptual understanding is one key limitation of the modern computer vision paradigm. Edelman also criticizes the notion of a "ground-truth" segmentation in computer vision, arguing that a fragmentation of the scene into useful chunks is in the eye of the beholder.
To summarize, Edelman points out that "The missing component is the capacity for having rich visual experiences... The visual world is always more complex than can be expressed in terms of a ﬁxed set of concepts, most of which, moreover, only ever exist in the imagination of the beholder." (OWMS) Being a pragmatist, many of these words resonate deeply within my soul, and I'm particularly attracted to elements of Edelman's antirealism.
I have to give two thumbs up to this article for pointing out the flaws in the current way computer vision scientists go about tackling vision problems (in other words researchers too often blindly work inside the current computer vision paradigm and do not often enough question fundamental assumptions which can help new paradigms arise). Many similar concerns regarding Computer Vision I have already pointed out on this blog, and it is reassuring to find others point to similar paradigmatic weaknesses. Such insights need to somehow leave the Philosophy/Psychology literature and make a long lasting impact in the CVPR/NIPS/ICCV/ECCV/ICML communities. The problem is that too many researchers/hackers actually building vision systems and teaching Computer Vision courses have no clue who Wittgenstein is and that they can gain invaluabe insights from Philosophy and Psychology alike. Computer Vision is simply not lacking computational methods, it is gaining critical insights that cannot be found inside an Emacs buffer. In order to advance the field, one needs to: read, write, philosophize, as well as mathematize, exercise, diversify, be a hacker, be a speaker, be one with the terminal, be one with prose, be a teacher, always a student, a master of all trades; or simply put, be a Computer Vision Jedi.