Tuesday, June 16, 2009

On Edelman's "On what it means to see"

I previously mentioned Shimon Edelman in my blog and why his ideas are important for the advancement of computer vision. Today I want to post a review of a powerful and potentially influential 2009 piece written by Edelman.

Below is a review of the June 16th, 2009 version of this paper:
Shimon Edelman, On what it means to see, and what we can do about it, in Object Categorization: Computer and Human Vision Perspectives, S. Dickinson, A. Leonardis, B. Schiele, and M. J. Tarr, eds. (Cambridge University Press, 2009, in press). Penultimate draft.

I will refer to the article as OWMS (On What it Means to See).

The goal of Edelman's article is to demonstrate the limitations of conceptual vision (referred to as "seeing as"), criticize the modern computer vision paradigm as being overly conceptual, and show how providing a richer representation of a scene is required for advancing computer vision.

Edelman proposes non-conceptual vision, where categorization isn't forced on an input -- "because the input may best be left altogether uninterpreted in the traditional sense." (OWMS) I have to agree with the author, where abstracting away the image into a conceptual map is not only an impoverished view of the world, but it is not clear whether such a limited representation is useful for other tasks relying on vision (something like the bottom of Figure 1.2 in OWMS or the Figure seen below and taken from my Recognition by Association talk).


Building a Conceptual Map = Abstracting Away





Drawing on insights from the influential Philosopher Wittgenstein, Edelman discusses the difference between "seeing" versus "seeing as." "Seeing as" is the easy-to-formalize map-pixels-to-objects attitude which modern computer vision students are spoon fed from the first day of graduate school -- and precisely the attitude which Edelman attacks in this wonderful article. To explain "seeing" Edelman uses some nice prose from Wittgenstein's Philosophical Investigations; however, instead of repeating the passages Edelman selected, I will complement the discussion with a relevant passage by William James:

The germinal question concerning things brought for the first time before consciousness is not the theoretic "What is that?" but the practical "Who goes there?" or rather, as Horwicz has admirably put it, "What is to be done?" ... In all our discussions about the intelligence of lower animals the only test we use is that of their acting as if for a purpose. (William James in Principles of Psychology, page 941)

"Seeing as" is a non-invertible process that abstracts away visual information to produce a lower dimensional conceptual map (see Figure above), whereas "seeing" provides a richer representation of the input scene. Its not exactly clear what is the best way to operationalize this "seeing" notion in a computer vision system, but the escapability-from-formalization might be one of the subtle points Edelman is trying to make about non-conceptual vision. Quoting Edelman, when "seeing" we are "letting the seething mass of categorization processes that in any purposive visual system vie for the privilege of interpreting the input be the representation of the scene, without allowing any one of them to gain the upper hand." (OWMS) Edelman goes on to criticize "seeing as" because vision systems have to be open-ended in the sense that we cannot specify ahead of time all the tasks that vision will be applied to. According to Edelman, conceptual vision cannot capture the ineffability (or richness) of the human visual experience. Linguistic concepts capture a mere subset of visual experience, and casting the goal of vision as providing a linguistic (or conceptual) interpretation is limited. The sparsity of conceptual understanding is one key limitation of the modern computer vision paradigm. Edelman also criticizes the notion of a "ground-truth" segmentation in computer vision, arguing that a fragmentation of the scene into useful chunks is in the eye of the beholder.

To summarize, Edelman points out that "The missing component is the capacity for having rich visual experiences... The visual world is always more complex than can be expressed in terms of a fixed set of concepts, most of which, moreover, only ever exist in the imagination of the beholder." (OWMS) Being a pragmatist, many of these words resonate deeply within my soul, and I'm particularly attracted to elements of Edelman's antirealism.

I have to give two thumbs up to this article for pointing out the flaws in the current way computer vision scientists go about tackling vision problems (in other words researchers too often blindly work inside the current computer vision paradigm and do not often enough question fundamental assumptions which can help new paradigms arise). Many similar concerns regarding Computer Vision I have already pointed out on this blog, and it is reassuring to find others point to similar paradigmatic weaknesses. Such insights need to somehow leave the Philosophy/Psychology literature and make a long lasting impact in the CVPR/NIPS/ICCV/ECCV/ICML communities. The problem is that too many researchers/hackers actually building vision systems and teaching Computer Vision courses have no clue who Wittgenstein is and that they can gain invaluabe insights from Philosophy and Psychology alike. Computer Vision is simply not lacking computational methods, it is gaining critical insights that cannot be found inside an Emacs buffer. In order to advance the field, one needs to: read, write, philosophize, as well as mathematize, exercise, diversify, be a hacker, be a speaker, be one with the terminal, be one with prose, be a teacher, always a student, a master of all trades; or simply put, be a Computer Vision Jedi.

3 comments:

  1. Anonymous4:29 PM

    I partly disagree with the criticism of ground truth segmentations. While I agree for datasets like Berkeley, where the criticism imho applies. In general it's unwarranted. For example a segment that extracts the full body of a person is quite objectively the right one, as all of its features will be isolated from the background and grouped together, ready to compare with other exemplars, for, say, pose prediction. There is no other segment that is better. Your own BMVC paper shows the improvements of having a full object segment vs a rectangular window.

    About the psycho/philosophy part, i've read quite a bit of that, and I agree that it's good to look outside the current zeitgeist. But in the end philosophers are mostly contemplative people; hacking code gives you insights that these people can't get. It's plausible that vision is just like an airplane: it might not need to be like a bird.

    ReplyDelete
  2. On the segmentation part of your comment, I would say that if you want to extract the pose of a person then it definitely makes sense to talk about a "good segmentation" of a person. The problem is that if you look at a natural scene, there are many different ways to group the scene into coherent segments. Consider an image of a forest, where one can group all of the vegetation into a segment and call that "forest" or once can segment out individual trees and call them "tree." To make things worse, should a "tree" segment contain the trunk as well as the leaves, or are "trunk" and "leaves" two different objects. How about the root possibly sticking out of the ground? Is that going to be part of the tree segment or not? I think Edelman's criticism is that there is no God's Eye "Ground-Truth" independent of the observer -- how we segment the world into concepts is different from how rats segment the world.

    For the psycho/philo part I want to suggest that progress will be made by researchers that blend ideas from many different domains. I agree that hacking produces insights inaccessible to the scholars stuck reading Philosophy, but people reading only Machine Learning/Computer Vision papers are also at a disadvantage.

    But I agree 100% that vision is just like an airplane: it might not need to be like a bird!

    ReplyDelete
  3. Mosalam5:33 AM

    Just my 2 cents, in CV algorithms are mainly application specific (pose prediction), and your algorithms can be inspired by psychological, philosophical, or biological ideas/theories. One can argue neuroscience is even a better option.

    Now there are face detectors that can even faster than human find all the faces in an image with the same accuracy. But those algorithms wouldn't distinguish drawn faces from photos. They're just not intended to do so. There is no need to waste computation for something that isn't necessary.

    ReplyDelete