Wednesday, August 25, 2010

Multifaceted Knowledge Representation: Ideas from Marvin Minsky

"I think a key to AI is the need for several representations of the knowledge, such that when the system is stuck (using one representation) it can jump to use another. When David Marr at MIT moved into computer vision, he generated a lot of excitement, but he hit up against the problem of knowledge representation; he had no good representations for knowledge in his vision systems." -- Marvin Minsky

Check out the full interview with Marvin Minsky here -- a must read for anybody serious about building intelligent machines!  This interview appears to be a part of a larger volume: Hal's Legacy.

I believe that in order to make the enterprise of computer vision of success, we must seriously broaden our outlook on the problem.  Are we seriously expecting algorithms to delineate object boundaries from real images based on statistics of patch descriptors without any sort of model of the world?

I don't know about you, but I seriously want to build intelligent machines.  I don't think there will ever be any sort of low-level SIFT-esque algorithm that "solves vision."  It is a much grander picture of intelligence that I'm really after -- and successful computer vision will be a result(component?) of a higher-level intelligent machine.  Machines need to know about a whole lot more than is found in a single image -- and the necessary conceptual tools might not be present in the computer vision community.

A recurring theme in my blog is my belief that we must become renaissance men -- a unison of *nix hackers, vision scientists, cognitive scientists, philosophers, athletes, machine learning scientists, skilled orators, and much more -- if we are to have any hope of chiseling away at the problem of computational intelligence.  Minsky was a pioneer of computational intelligence, and his words revitalize my own research efforts in this direction.

Monday, August 23, 2010

Beyond pixel-wise labeling: Blocks World Revisited

"Thoughts without content are empty, intuitions without concepts are blind." -- Immanuel Kant 

The Holy Grail problem of computer vision research is general-purpose image understanding.  Given as input a digital image (perhaps from Flickr or from Google Image search), we want to recognize the depicted objects (cars, dogs, sheep, Macbook Pros), their functional properties (which of the depicted objects are suitable for sitting), and recover the underlying geometry and spatial relations (which objects are lying on the desk). 

The early days of vision were dominated via the "Image Understanding as Inverse Optics" mentality.  In order to make the problem easier, as well as to cope with the meager computational resources of the 60s, early computer vision researchers tried to recover the 3D geometry of simple scenes consisting of arrangements of blocks.  One of the earlier efforts in this direction, is the PhD thesis Machine Perception of Three-Dimensional Solids by Larry Roberts from MIT back in 1963.

But wait -- these block-worlds are unlike anything found in the real world!  The drastic divide between the imagery that vision researchers were studying in the 60s and what humans observe during their daily experiences ultimately led to the disappearance of block-worlds in computer vision research.

Image Parsing Concept Image from Computer Blindness Blog

Over the past couple of decades, we have seen the success of Machine Learning, and it is of no surprise that we are currently living in the "Image Understanding as statistical inference" era.  While a single 256x256 grayscale image might have been okay to use in the 1960s, today's computer vision researchers use powerful computer clusters and do serious heavy-lifting on millions of real-world megapixel images.  The man-made blocks-world of the 1960s is a thing of the past, and the variety found on random images downloaded from Flickr is the complexity we must now cope with.

While the style of computer vision research has shifted since its early days in the 1960s/1970s,  many old ideas (and perhaps prematurely considered outdated) are making a comeback!

Assigning basic-level object category labels to pixels is a very popular theme in vision.  Unfortunately, to gain a deeper understanding of an image, robots will inevitably have to go beyond pixel-level class labels.  (This is one of the central themes in my thesis -- coming out soon!)  Given human-level understanding of a scene, it is trivial to represent it as a pixel-wise labeling map, but given a pixel-wise labeling map it is not trivial to convert it to human-level understanding. 

What sort of questions can be answered about a scene when the output of an "image understanding" system is represented as a pixel-wise label map?

1. Is there a car in the image?
2. Is there a person at this location in the image?

What questions cannot be answered given a pixel-wise label map?

1. How many cars are in this image? (While there are some approaches that strive to deal with delineating object instance boundaries, most image parsing approaches fail to recognize boundaries between two instances of the same category)
2. Which surfaces can I sit on?
3. Where can I park my car?
4. How geometrically stable are the objects in the scene?

While I have more criticisms than tentative solutions, I believe that vision students shouldn't be parochially preoccupied with solely the most recent approach to image understanding.  It is valuable to go back several decades in the literature and gain a broader perspective on image understanding.  However, some progress is being made!  A deeply insightful upcoming paper from ECCV 2010, is the following:

Abhinav Gupta, Alexei A. Efros and Martial Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics, European Conference on Computer Vision, 2010. (PDF)

What Abhinav Gupta does very elegantly in this paper is connect the blocks-world research of the 1960s with the geometric-class estimation problem, as introduced by Derek Hoiem.  While the final system is evaluation in a Hoiem-like pixel-wise labeling task, the actual scene representation is 3D.  The blocks in this approach are more abstract than the Lego-like volumes in the 1960s -- Abhinav's blocks are actually cars, buildings, and trees. I included the infamous Immanuel Kant quote, because I feel it describes Abhinav's work very well.  Abhinav introduces the block as a theoretical construct which glues together a scene's elements and provides a much more solid interpretation -- Abhinav's blocks add the content to geometric image understanding which is lacking in the purely pixe-wise approaches.

While integrating large-scale categorization into this type of geometric reasoning is still an open problem, Abhinav provides us visionaries with a glimpse of what image understanding should be.  The integration of robotics with image understanding technology will surely drive pixel-based "dumb" image understanding approaches to extinction.