Monday, August 23, 2010

Beyond pixel-wise labeling: Blocks World Revisited

"Thoughts without content are empty, intuitions without concepts are blind." -- Immanuel Kant 

The Holy Grail problem of computer vision research is general-purpose image understanding.  Given as input a digital image (perhaps from Flickr or from Google Image search), we want to recognize the depicted objects (cars, dogs, sheep, Macbook Pros), their functional properties (which of the depicted objects are suitable for sitting), and recover the underlying geometry and spatial relations (which objects are lying on the desk). 

The early days of vision were dominated via the "Image Understanding as Inverse Optics" mentality.  In order to make the problem easier, as well as to cope with the meager computational resources of the 60s, early computer vision researchers tried to recover the 3D geometry of simple scenes consisting of arrangements of blocks.  One of the earlier efforts in this direction, is the PhD thesis Machine Perception of Three-Dimensional Solids by Larry Roberts from MIT back in 1963.

But wait -- these block-worlds are unlike anything found in the real world!  The drastic divide between the imagery that vision researchers were studying in the 60s and what humans observe during their daily experiences ultimately led to the disappearance of block-worlds in computer vision research.

Image Parsing Concept Image from Computer Blindness Blog

Over the past couple of decades, we have seen the success of Machine Learning, and it is of no surprise that we are currently living in the "Image Understanding as statistical inference" era.  While a single 256x256 grayscale image might have been okay to use in the 1960s, today's computer vision researchers use powerful computer clusters and do serious heavy-lifting on millions of real-world megapixel images.  The man-made blocks-world of the 1960s is a thing of the past, and the variety found on random images downloaded from Flickr is the complexity we must now cope with.

While the style of computer vision research has shifted since its early days in the 1960s/1970s,  many old ideas (and perhaps prematurely considered outdated) are making a comeback!

Assigning basic-level object category labels to pixels is a very popular theme in vision.  Unfortunately, to gain a deeper understanding of an image, robots will inevitably have to go beyond pixel-level class labels.  (This is one of the central themes in my thesis -- coming out soon!)  Given human-level understanding of a scene, it is trivial to represent it as a pixel-wise labeling map, but given a pixel-wise labeling map it is not trivial to convert it to human-level understanding. 

What sort of questions can be answered about a scene when the output of an "image understanding" system is represented as a pixel-wise label map?

1. Is there a car in the image?
2. Is there a person at this location in the image?

What questions cannot be answered given a pixel-wise label map?

1. How many cars are in this image? (While there are some approaches that strive to deal with delineating object instance boundaries, most image parsing approaches fail to recognize boundaries between two instances of the same category)
2. Which surfaces can I sit on?
3. Where can I park my car?
4. How geometrically stable are the objects in the scene?

While I have more criticisms than tentative solutions, I believe that vision students shouldn't be parochially preoccupied with solely the most recent approach to image understanding.  It is valuable to go back several decades in the literature and gain a broader perspective on image understanding.  However, some progress is being made!  A deeply insightful upcoming paper from ECCV 2010, is the following:

Abhinav Gupta, Alexei A. Efros and Martial Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics, European Conference on Computer Vision, 2010. (PDF)

What Abhinav Gupta does very elegantly in this paper is connect the blocks-world research of the 1960s with the geometric-class estimation problem, as introduced by Derek Hoiem.  While the final system is evaluation in a Hoiem-like pixel-wise labeling task, the actual scene representation is 3D.  The blocks in this approach are more abstract than the Lego-like volumes in the 1960s -- Abhinav's blocks are actually cars, buildings, and trees. I included the infamous Immanuel Kant quote, because I feel it describes Abhinav's work very well.  Abhinav introduces the block as a theoretical construct which glues together a scene's elements and provides a much more solid interpretation -- Abhinav's blocks add the content to geometric image understanding which is lacking in the purely pixe-wise approaches.

While integrating large-scale categorization into this type of geometric reasoning is still an open problem, Abhinav provides us visionaries with a glimpse of what image understanding should be.  The integration of robotics with image understanding technology will surely drive pixel-based "dumb" image understanding approaches to extinction.


  1. I see my photo is useful. =)

    About the "Holy Grail" problem -- do you know the correct attribution? Which book/paper one can refer to?

  2. Your image was useful! I found the pic via Google image search, and it seemed easier that pasting one from a paper. I added a link to your blog underneath the image.

    I don't think there's a single initial paper that refers to image understanding as the "Holy Grail" problem -- it is a phrase I've picked up from my advisor.

  3. Anonymous10:20 AM

    I like your blog a lot but I honestly find the color scheme to be painful for my eyes. Perhaps it's the white letters on black background. I'd much prefer black letters on white background.

  4. Anonymous10:42 AM

    I am a computer vision scientist and liked your blog. you have mentioned that it was easier for you to get the image from Google than from a blog. I agree with you. However, not all images are there if they are part of a PDF. I have found the following tool useful. Just check "Extract Images from PDF"

  5. @Tomasz
    I did not mean the name but the problem itself. Who stated it first? Probably, Marr or Minsky.

  6. Minsky I would vote -- much earlier than Marr.

  7. Anonymous3:58 AM

    Interesting post and pictures are useful too.
    Doing my research I find one amazing free to download book about Computer Vision.
    This book presents research trends on computer vision, especially on application of robotics, and on advanced approaches for computer vision (such as omnidirectional vision).
    The contents of this book allow the reader to know more technical aspects and applications of computer vision.
    The intended audience is anyone who wishes to become familiar with the latest research work on computer vision, especially its applications on robots.This book features representative work on the computer vision, and it puts more focus on robotics vision and omnidirectional vision.
    This is the link where you can find it: