Tombone's Computer Vision Blog: perception

Showing posts with label perception. Show all posts

Sunday, June 13, 2010

Sven Dickinson at POCV 2010, ACVHL Tomorrow

This morning, Sven Dickinson gave a talk to start the POCV 2010 Workshop at CVPR2010. For those of you who might not know, POCV stands for Perceptual Organization in Computer Vision. While segmentation can be thought of as a perceptual grouping process, contiguous regions don't have to be the end product of a meaningful perceptual grouping process. There are many popular and useful algorithms which group non-accidental contours yet come short of a full-blown image segmentation.

The title of Dickinson's talk was "The Role of Intermediate Shape Priors in Perceptual Grouping and Image Abstraction." In the beginning of his talk, Sven pointed out how perceptual organization was at its prime in the mid 90s and declined in the 2000s due to the popularity of machine learning and the "detection" task. He believes that good perceptual grouping is what is going to make vision scale -- that is, without first squeezing out all that we can out of the bottom level we are doomed to fail.

Dickinsons showed some nice results from his most recent research efforts where objects are broken down into generic "parts" -- this reminded me of Biederman's geons, although Sven's fitting is done in the 2D image plane. Sven emphasized that successful shape primitives must be category-independent if we are to have scalable recognition of thousands of visual concepts in images. This is much different than the mainstream per-category object detection task which has been popularized by contests such as the PASCAL VOC.

While I personally believe that there is a good place for perceptual organization in vision, I wouldn't view it as the Holy Grail. It is perhaps the Holy Bridge we must inevitably cross on the way to finding the Holy Grail. I believe that for full-grown fully-functional members of society, our ability to effortlessly cope with the world is chiefly due to its simplicity and repeatability, and not due to some amazing internal perceptual organization algorithm. Perhaps it is when we were children -- viewing the world through a psychedelic fog of innocence -- that perceptual grouping helped us cut up the world into meaningful entities.

A common theme in Sven's talk was the idea of Learning to Group in a category-independent way. This means that all of the successes of Machine Learning aren't thrown out the door, and this appears to a quite different way of grouping than what has been done in the 1970s.

Tomorrow I will be at ACVHL Workshop "Advancing Computer Vision with Humans in the Loop". I haven't personally "turked" yet, but I feel I will be jumping on the bandwagon soon. Anyways, the keynote speakers should make for an awesome workshop. They do not need introductions: David Forsyth, Aude Oliva, Fei-Fei Li, Antonio Torralba, and Serge Belongie -- all influential visionaries.

Monday, January 18, 2010

Understanding versus Interpretation -- a philosophical distinction

Today I want to bring up an interesting discussion regarding the connotation of the word "understanding" versus "interpretation," particularly in the context of "scene understanding" versus "scene interpretation." While many vision researchers use these terms interchangeably, I think it is worthwhile to make the distinction, albeit a philosophical one.

On Understanding
While everybody knows that the goal of computer vision is to recognize all of the objects in an image, there is plenty of disagreement about how to represent objects and recognize them in the image. There is a physicalist account (from Wikipedia: Physicalism is a philosophical position holding that everything which exists is no more extensive than its physical properties), where the goal of vision is to reconstruct veridical properties of the world. This view is consistent with the realist stance in philosophy (think back to Philosophy 101) -- there exists a single observer-independent 'ground-truth' regarding the identities of all of the objects contained in the world. The notion of vision as measurement is very strong under this physicalist account. The stuff of the world is out there just waiting to be grasped! I think the term "understanding" fits very well into this truth-driven account of computer vision.

On interpretation
The second view, a postmodern and anti-realist one, is of vision as a way of interpreting scenes. The shift is from veridical recovery of the properties of the world from an image (measurement) to the observer-dependent interpretation of the input stimulus. Under this account, there is no need to believe in a god's eye 'objective' view of the world. Image interpretation is the registration of an input image with a vast network of past experience, both visual and abstract. The same person can vary their own interpretation of an input as time passes and the internal knowledge based has evolved. Under this view, two distinct robots could provide very useful yet distinct 'image interpretations' of the same input image. The main idea is that different robots could have different interpretation-spaces, that is they could obtain incommensurable (yet very useful!) interpretations of the same image.

It has been argued by Donald Hoffman (Interface Theory of Perception) that there is no reason why we should expect evolution to have driven humans towards veridical perception. In fact, Hoffman argues that natures drives veridical perception towards extinction and it only makes sense to speak of perception as guiding agents towards pragmatic interpretations of their environment.

In philosophy of science, there is the debate of whether the field of physics is unraveling some ultimate truth about the world versus physics painting a coherent and pragmatic picture of the world. I've always viewed science as an art and I embrace my anti-realist stance -- which has been shaped by Thomas Kuhn, William James, and many others. While my scientific interests have currently congealed in computer vision, it is no surprise that I'm finding conceptual agreement between my philosophy of science and my concrete research efforts in object recognition.

Monday, March 30, 2009

Time Travel, Perception, and Mind-wandering

Today's post is dedicated to ideas promulgated by Bar's most recent article, "The proactive brain: memory for predictions."

Bar builds on the foundation of his former thesis, namely that the brain's 'default' mode of operation is to daydream, fantasize, and continuously revisit and reshape past memories and experiences. While it makes sense that traversing the internal network of past experiences is useful when trying to understand a complex novel phenomenon, why exert so much work when just 'chilling out' a.k.a. being in the 'default' mode? Bar's proposal is that this seemingly wasteful daydreaming is actually crucial for generating virtual experiences and synthesizing not-directly-experienced, yet critically useful memories of alternate scenarios. These 'alternate future memories' are how our brain recombines tidbits from actual experiences and helps us understand novel scenarios before they actually happen. It makes sense that the brain has a method for 'densifying' the network of past experiences, but that this happens in the 'default' mode a truly bold view held by Bar.

In the domain of visual perception and scene understanding, the world has much regularity. Thus the predictions generated by our brain often match the percept, and thus accurate predictions rid us of the need to exert mental brainpower on certain predictable aspects of the world. For example, seeing a bunch of cars on a road along with a bunch of windows on a building pre-sensitizes us so much with respect to seeing a stop sign in an intimate spatial relationship with the other objects that we don't need to perceive much more than speckle of red for a nanosecond to confirm its presence in the scene.

Quoting Bar, "we are rarely in the 'now'" since when understanding the visual world we integrate information from multiple points in time. We use the information perceptible to our senses (the now), memories of former experiences (the past), as well all of the recombined and synthesized scenarios explored by our brains and encoded as virtual memories (plausible futures). In each moment of our waking life, our brains provide us with a shortlist of primed (to be expected) objects, contexts, and their configurations related to our immediate perceptible future. Who says we can't travel through time? -- it seems we are already living a few seconds ahead of direct perception (the immediate now).

Thursday, March 26, 2009

Beyond Categorization: Getting Away From Object Categories in Computer Vision

Natural language evolved over thousands of years to become the powerful tool that is is today. When we say things using language to convey our experiences with the world, we can't help but refer to object categories. When we say things such as "this is a car" what we are actually saying is "this is an instance from the car category." Categories let us get away from referring to individual object instances -- in most cases knowing that something belongs to a particular category is more than enough knowledge to deal with it. This is a type of "understanding by compression" or understanding by abstracting away the unnecessary details. In the words of Rosch, "the task of category systems is to provide maximum information with the least cognitive effort." Rosch would probably agree that it only makes sense to talk about the utility of a category system (a for getting a grip on reality) as opposed to the truth value of a category system with respect how well it aligns to observer-independent reality. The degree of pragmatism expressed by Rosch is something that William James would have been proud of.

From a very young age we are taught language and soon it takes over our inner world. We 'think' in language. Language provides us with a list of nouns -- a way of cutting up the world into categories. Different cultures have different languages that cut up the world differently and one might wonder how well the object categories contained in any given single language correspond to reality -- if it even makes sense to talk about an observer independent reality. Rosch would argue that human categorization is the result of "psychological principles of categorization" and is more related to how we interact with the world than how the world is. If the only substances we ingested for nutrients were types of grass, then categorizing all of the different strains of grass with respect to flavor, vitamin content, color, etc would be beneficial for us (as a species). Rosch points out in her works that her ideas refer to categorization at the species-level and she calls it human categorization. She is not referring to a personal categorization; for example, the way a child might cluster concepts when he/she starts learning about the world.

It is not at all clear to me whether we should be using the categories from natural language as the to-be-recognized entities in our image understanding systems. Many animals do not have a language with which they can compress percepts into neat little tokens -- yet they have no problem interacting with the world. Of course, if we want to build machines that understand the world around them in a way that they can communicate with us (humans), then language and its inherent categorization will play a crucial role.

While we ultimately use language to convey our ideas to other humans, how early are the principles of categorization applied to perception? Is the grouping of percepts into categories even essential for perception? I doubt that anybody would argue that language and its inherent categorization is not useful for dealing with the world -- the only question is how it interacts with perception.

Most computer vision researchers are stuck in the world of categorization and many systems rely on categorization at a very early stage. A problem with categorization is its inability to deal with novel categories -- something which humans must deal with at a very young age. We (humans) can often deal with arbitrary input and using analogies can still get a grip and the world around us (even when it is full of novel categories). One hypothesis is that at the level of visual perception things do not get recognized into discrete object classes -- but a continuous recognition space. Thus instead of asking the question, "What is this?" we focus on similarity measurements and ask "What is this like?". Such a comparison-based view would help us cope with novel concepts.

Thursday, February 12, 2009

Context is the 'glue' that binds objects in coherent scenes.

This is not my own quote. It is on of my favorites from Moshe Bar. It comes from his paper "Visual Objects in Context."

I have been recently giving context (in the domain of scene understanding) some heavy thought.
While Bar's paper is good, the one I wanted to focus on goes back to the 1980s. According to Bar, the following paper (which I wish everybody would at least skim) is "a seminal study that characterizes the rules that govern a scene's structure and their influence on perception."

Biederman, I., Mezzanotte, R. J. & Rabinowitz, J. C. Scene perception: detecting and judging objects undergoing relational violations. Cogn. Psychol. 14, 143–177 (1982).

Biederman outlines 5 object relations. The three semantic (related to object categories) relations are probability, position, and familiar size. The two syntactic (not operating at the object category level) relations are interposition and support. According to Biederman, "these relations might constitute a sufficient set with which to characterize the organizations of a real-world scene as distinct from a display of unrelated objects." The world has structure and characterizing this structure in terms of such rules is quite a noble effort.

A very interesting question that Biederman addresses is the following: do humans reason about syntactic relations before semantic relations or the other way around? A Gibsonian (direct perception) kind of way to think about the world is that processing of depth and space precedes the assignment of identity to the stuff that occupies the empty space around us. J.J. Gibson's view is in accordance with Marr's pipeline.

However, Biederman's study with human subjects (he is a psychologist) suggests that information about semantic relationships among objects is not accessed after semantic-free physical relationships. Quoting him directly, "Instead of a 3D parse being the initial step, the pattern recognition of the contours and the access to semantic relations appear to be the primary stages" as well as "further evidence that an object's semantic relations to other objects are processed simultaneously with its own identification."

Now that I've wet your appetite, let's bring out the glue.

P.S. Moshe Bar was a student of I. Biederman and S. Ullman (Against Direct Perception author).

Tuesday, January 13, 2009

Computer Vision Courses, Measurement, and Perception

The new semester began at CMU and I'm happy to announce that I'm TAing my advisor's 16-721 Learning Based Methods in Vision this semester. I'm also auditing Martial Hebert's Geometry Based Methods in Vision.

This semester we're trying to encourage students of 16-721 LBMV09 to discuss papers using a course discussion blog. Quicktopic has been used in the past, but this semester we're using Google's Blogger.com for the discussion!

In the first lecture of LBMV, we discussed the problem of Measurement versus Perception in a Computer Vision context. The idea is that while we could build vision systems to measure the external world, it is percepts such as "there is a car on the bottom of the image" and not measurements such as "the bottom of the image is gray" that we are ultimately interested in. However, the line between measurement and perception is somewhat blurry. Consider the following gedanken experiment: place a human in a box and feed him an image and the question "is there a car on the bottom of the image?". Is it legitimate to call this apparatus as a measurement device? If so, then isn't perception a type of measurement? We would still have the problem of building a second version of this measurement device -- different people have different notions of cars and when we start feeding two apparatuses examples of objects that are very close to trucks/buses/vans/cars then would would loss measurement repeatability.

This whole notion of measurement versus perception in computer vision is awfully similar to the theory and observation problem in philosophy of science. Thomas Kuhn would say that the window through which we peer (our scientific paradigm) circumscribes the world we see and thus it is not possible to make theory-independent observations. For a long time I have been a proponent of this post modern view of the world. The big question that remains is: for computer vision to be successful how much consensus must there be between human perception and machine perception? If according to Kuhn Aristotelian and Galilean physicists would have different "observations" of an experiment, then should we expect intelligent machines to see the same world that we see?

Tuesday, November 04, 2008

Computer Vision as immature?

Ashutosh Saxena points out on his web page an interesting quote from Wikipedia about Computer Vision from August 2007. I just checked out the Wikipedia article on Computer Vision, and it seems this paragraph is still there. Parts of it go as follows:

The field of computer vision can be characterized as immature and diverse ... Consequently there is no standard formulation of "the computer vision problem." ... no standard formulation of how computer vision problems should be solved.

I agree that there is no elegant equation akin to F=ma or Schrodinger's Wave Equation that is magically supposed to explain how meaning is supposed to be attributed to images. While this might seem like a weak point, especially to the mathematically inclined always seeking to generalize and abstract away, I am skeptical of Computer Vision ever being grounded in such an all-encompassing mathematical theory.

Being a discipline centered on perception and reasoning, there is something about Computer Vision that will make it forever escape formalization. State of the art computer vision systems that operate on images can return many different types of information. Some systems return bounding boxes of all object instances from a single category, some systems break up the image into regions (segmentation) and say nothing about object classes/categories, and other systems assign a single object-level category to the entire image without performing any localization/segmentation. Aside from objects, some systems (See Hoiem et al. and Saxena et al.) return a geometric 3D layout of the scene. While it seems that humans can do extremely well at all these tasks, it makes sense that different robotic agents interacting with the real world should percieve the world differently to accomplish their own varying tasks. Thing of biological vision -- do we see the same world as dogs? Is there an objective observer-independent reality that we are supposed to see? To me, perception is very personal, and while my hardware (brain) might appear similar to another human's I'm not convinced that we see/perceive/understand the world the same way.

I can imagine ~40 years ago researchers/scientists trying to come up with an abstract theory of computation that would allow one to run arbitrary computer programs. What we have today is myriad operating systems and programming languages suited for different crowds and different applications. While the humanoid robot in our living room is nowhere to be found, I believe if we wait until that day and inspect its internal working we will not see a beautiful rigorous mathematical theory. We will see AI/mechanical components developed by different researcher groups and integrated by other researchers -- the fruits of a long engineering effort. These bots will be always learning, always updating, always getting updates, and always getting replaced by newer and better ones.