Friday, June 21, 2013

[Awesome@CVPR2013] Image Parsing with Regions and Per-Exemplar Detectors

I've been making an inventory of all the awesome papers at this year's CVPR 2013 conference, and one which clearly stood out was Tighe & Lazebnik's paper titled:

This paper combines ideas from segmentation-based "scene parsing" (see the below video for the output of their older ECCV2010 SuperParsing system) as well as per-exemplar detectors (see my Exemplar-SVM paper, as well as my older Recognition by Association paper).  I have worked and published in these two separate lines of research, so when I tell you that this paper is worthy of reading, you should at least take a look.  Below I outline the two ideas which are being synthesized in this paper, but for all details you should read their paper (PDF link).  See the overview figure below:

Idea #1: "Segmentation-driven" Image Parsing
The idea of using bottom-up segmentation to parse scenes is not new.  Superpixels (very small segments which are likely to contain a single object category) coupled with some machine learning can be used to produce a coherent scene parsing system; however, the boundaries of objects are not as precise as one would expect.  This shortcoming stems from the smoothing terms used in random field inference and because generic category-level classifiers have a hard time reasoning about the extent of an object.  To see how superpixel-based scene parsing works, check out the video from their older paper from ECCV2010:

Idea #2: Per-exemplar segmentation mask transfer
For me, the most exciting thing about this paper is the integration of the segmentation mask transfer from exemplar-based detections.  The ideas is quite simple: each detector is exemplar-specific and is thus equipped with its own (precise) segmentation mask.  When you produce detections from such exemplar-based systems, you can immediately transfer segmentations in a purely top-down manner.  This is what I have been trying to get people excited about for years!  Congratulations to Joseph Tighe for incorporating these ideas into a full-blow image interpretation system.  To see an example of mask transfer, check out the figure below.

Their system produces a per-pixel labeling of the input image, and as you can see below, the results are quite good.  Here are some more outputs of their system as compared to solely region-based as well as solely detector-based systems.  Using per-exemplar detectors clearly complements superpixel-based "segmentation-driven" approaches.

This paper will be presented as an oral in the Orals 3C session called "Context and Scenes" to be held on Thursday, June 27th at CVPR 2013 in Portland, Oregon.