Tuesday, October 13, 2009

What is segmentation-driven object recognition?

In this post, I want to discuss what the term "segmentation-driven object recognition" means to me. While segmentation-only and object recognition-only research papers are ubiquitous in vision conferences (such as CVPR , ICCV, and ECCV), a new research direction which uses segmentation for recognition has emerged. Many researchers pushing in this direction are direct descendants of the great J. Malik such as Belongie, Efros, Mori, and many others. The best example of segmentation-driven recognition can be found in Rabinovich's Objects in Context paper. The basic idea in this paper is to compute multiple stable segmentations of an input image using Ncuts and use a dense probabilistic graphical model over segments (combining local terms and segment-segment context) to recognize objects inside those regions.

Segmentation-only research focuses on the actual image segmentation algorithms -- where the output of a segmentation algorithm is a partition of a 2D image into contiguous regions. Algorithms such as mean-shift, normalized cuts, as well as 100s of probabilistic graphical models can be used produce such segmentations. The Berkeley group (in an attempt to salvage "mid-level" vision) has been working diligently on boundary detection and image segmentation for over a decade.

Recognition-only research generally focuses on new learning techniques or building systems to perform well on detection/classification benchmarks. The sliding window approach coupled with bag-of-words models has dominated vision and is the unofficial method of choice.

It is easy to relax the bag-of-words model, so let's focus on rectangles for a second. If we do not use segmentation, the world of objects will have to conform to sliding rectangles and image parsing will inevitably look like this:

(Taken from Bryan Russell's Object Recognition by Scene Alignment paper).

It has been argued that segmentation is required to move beyond the world of rectangular windows if we are to successfully break up images into their constituent objects. While some objects can be neatly approximated by a rectangle in the 2D image plane, to explain away an arbitrary image free-form regions must be used. I have argued this point extensively in my BMVC 2007 paper, and the interesting result was that multiple segmentations must by used if we want to produce reasonable segments. Sadly, segmentation is generally not good enough by itself to produce object-corresponding regions.

(Here is an example of the Mean Shift algorithm where to get a single cow segment two adjacent regions had to be merged.)

The question of how to use segmentation algorithms for recognition is still open. If segmentation could tessellate an image into "good" regions in one-shot then the goal of recognition is to simply label these regions and life becomes simple. This is unfortunately far from reality. While blobs of homogeneous appearance often correspond to things like sky, grass, and road, many objects do not pop out as a single segment. I have proposed using a soup of such segments that come from different algorithms being ran with different parameters (and even merging pairs and triplets of such segments!) but this produces a large number of regions and thus making the recognition task harder.

Using a soup of segments, a small fraction of the regions might be of high quality; however, recognition now has to throw away 1000s of misleading segments. Abhinav Gupta, a new addition to CMU vision community, has pointed out that if we want to model context between segments (and for object-object relationships this means a quadratic dependence on the number of segments), using a large soup of segments in simply not tractable. Either the number of segments or the number of context interactions has to be reduced in this case, but non-quadratic object-object context models are an open question.

In conclusion, the representation used by segmentation (that of free-form regions) is superior to sliding window approaches which utilize rectangular windows. However, off-the-shelf segmentation algorithms are still lacking with respect to their ability to generate such regions. Why should an algorithm that doesn't know anything about objects be able to segment out objects? I suspect that in the upcoming years we will see a flurry of learning-based segmenters that provide a blend of recognition and bottom-up grouping, and I envision such algorithms to be used a strictly non-feedforward way.


  1. Anonymous5:35 AM

    Hi, check the winning entry in the segmentation challenge of pascal 2009.
    The segmentation method there produces a small number of very accurate segments, fully bottom-up.

  2. Thanks for pointing out the cool segmentation work -- the University of Bonn approach does seem promising.

  3. Anonymous4:04 AM

    By the way, regarding "recognition now has to throw away 1000s of misleading segments". Object detection algorithms do this routinely with bounding boxes, why should it be harder with segments ?

  4. I think the problem of recognition inside segments is inherently more difficult than recognition inside sliding windows. Much of the ambiguity comes when we treat segment recognition as solving a bunch of independent binary classification problems. Since segments are free-form as opposed to rigid rectangles their variation is inherently higher dimensional.

    Consider what happens when you stare at a painting such as Van Gogh's Mountains at Saint-Remy -- different brush strokes seem to be combined in a plethora of ways to hallucinate objects in the painting. The longer you stare the more objects you see.

  5. Hi, please i want to aske you about the segmentation of object and detecting.
    because we have Graduation Project, it is system help the blind people .
    we search the algorthim for our project, we find "real-time algorthim 100 object recognition system"
    but we want to know more about segmentation please.
    Thank you very much...

  6. Anonymous3:45 PM

    Hi, please can you give me some examples of videos dataset that can i use for my project in object detection and recognition in video ???
    please help me
    thanks :)

  7. I recommend taking a look at MIT's LabelMe videos.