Monday, August 23, 2010

Beyond pixel-wise labeling: Blocks World Revisited

"Thoughts without content are empty, intuitions without concepts are blind." -- Immanuel Kant 

The Holy Grail problem of computer vision research is general-purpose image understanding.  Given as input a digital image (perhaps from Flickr or from Google Image search), we want to recognize the depicted objects (cars, dogs, sheep, Macbook Pros), their functional properties (which of the depicted objects are suitable for sitting), and recover the underlying geometry and spatial relations (which objects are lying on the desk). 

The early days of vision were dominated via the "Image Understanding as Inverse Optics" mentality.  In order to make the problem easier, as well as to cope with the meager computational resources of the 60s, early computer vision researchers tried to recover the 3D geometry of simple scenes consisting of arrangements of blocks.  One of the earlier efforts in this direction, is the PhD thesis Machine Perception of Three-Dimensional Solids by Larry Roberts from MIT back in 1963.

But wait -- these block-worlds are unlike anything found in the real world!  The drastic divide between the imagery that vision researchers were studying in the 60s and what humans observe during their daily experiences ultimately led to the disappearance of block-worlds in computer vision research.

Image Parsing Concept Image from Computer Blindness Blog

Over the past couple of decades, we have seen the success of Machine Learning, and it is of no surprise that we are currently living in the "Image Understanding as statistical inference" era.  While a single 256x256 grayscale image might have been okay to use in the 1960s, today's computer vision researchers use powerful computer clusters and do serious heavy-lifting on millions of real-world megapixel images.  The man-made blocks-world of the 1960s is a thing of the past, and the variety found on random images downloaded from Flickr is the complexity we must now cope with.

While the style of computer vision research has shifted since its early days in the 1960s/1970s,  many old ideas (and perhaps prematurely considered outdated) are making a comeback!

Assigning basic-level object category labels to pixels is a very popular theme in vision.  Unfortunately, to gain a deeper understanding of an image, robots will inevitably have to go beyond pixel-level class labels.  (This is one of the central themes in my thesis -- coming out soon!)  Given human-level understanding of a scene, it is trivial to represent it as a pixel-wise labeling map, but given a pixel-wise labeling map it is not trivial to convert it to human-level understanding. 

What sort of questions can be answered about a scene when the output of an "image understanding" system is represented as a pixel-wise label map?

1. Is there a car in the image?
2. Is there a person at this location in the image?

What questions cannot be answered given a pixel-wise label map?

1. How many cars are in this image? (While there are some approaches that strive to deal with delineating object instance boundaries, most image parsing approaches fail to recognize boundaries between two instances of the same category)
2. Which surfaces can I sit on?
3. Where can I park my car?
4. How geometrically stable are the objects in the scene?

While I have more criticisms than tentative solutions, I believe that vision students shouldn't be parochially preoccupied with solely the most recent approach to image understanding.  It is valuable to go back several decades in the literature and gain a broader perspective on image understanding.  However, some progress is being made!  A deeply insightful upcoming paper from ECCV 2010, is the following:

Abhinav Gupta, Alexei A. Efros and Martial Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics, European Conference on Computer Vision, 2010. (PDF)

What Abhinav Gupta does very elegantly in this paper is connect the blocks-world research of the 1960s with the geometric-class estimation problem, as introduced by Derek Hoiem.  While the final system is evaluation in a Hoiem-like pixel-wise labeling task, the actual scene representation is 3D.  The blocks in this approach are more abstract than the Lego-like volumes in the 1960s -- Abhinav's blocks are actually cars, buildings, and trees. I included the infamous Immanuel Kant quote, because I feel it describes Abhinav's work very well.  Abhinav introduces the block as a theoretical construct which glues together a scene's elements and provides a much more solid interpretation -- Abhinav's blocks add the content to geometric image understanding which is lacking in the purely pixe-wise approaches.

While integrating large-scale categorization into this type of geometric reasoning is still an open problem, Abhinav provides us visionaries with a glimpse of what image understanding should be.  The integration of robotics with image understanding technology will surely drive pixel-based "dumb" image understanding approaches to extinction.