Friday, June 21, 2013

[Awesome@CVPR2013] Image Parsing with Regions and Per-Exemplar Detectors

I've been making an inventory of all the awesome papers at this year's CVPR 2013 conference, and one which clearly stood out was Tighe & Lazebnik's paper titled:

This paper combines ideas from segmentation-based "scene parsing" (see the below video for the output of their older ECCV2010 SuperParsing system) as well as per-exemplar detectors (see my Exemplar-SVM paper, as well as my older Recognition by Association paper).  I have worked and published in these two separate lines of research, so when I tell you that this paper is worthy of reading, you should at least take a look.  Below I outline the two ideas which are being synthesized in this paper, but for all details you should read their paper (PDF link).  See the overview figure below:

Idea #1: "Segmentation-driven" Image Parsing
The idea of using bottom-up segmentation to parse scenes is not new.  Superpixels (very small segments which are likely to contain a single object category) coupled with some machine learning can be used to produce a coherent scene parsing system; however, the boundaries of objects are not as precise as one would expect.  This shortcoming stems from the smoothing terms used in random field inference and because generic category-level classifiers have a hard time reasoning about the extent of an object.  To see how superpixel-based scene parsing works, check out the video from their older paper from ECCV2010:

Idea #2: Per-exemplar segmentation mask transfer
For me, the most exciting thing about this paper is the integration of the segmentation mask transfer from exemplar-based detections.  The ideas is quite simple: each detector is exemplar-specific and is thus equipped with its own (precise) segmentation mask.  When you produce detections from such exemplar-based systems, you can immediately transfer segmentations in a purely top-down manner.  This is what I have been trying to get people excited about for years!  Congratulations to Joseph Tighe for incorporating these ideas into a full-blow image interpretation system.  To see an example of mask transfer, check out the figure below.

Their system produces a per-pixel labeling of the input image, and as you can see below, the results are quite good.  Here are some more outputs of their system as compared to solely region-based as well as solely detector-based systems.  Using per-exemplar detectors clearly complements superpixel-based "segmentation-driven" approaches.

This paper will be presented as an oral in the Orals 3C session called "Context and Scenes" to be held on Thursday, June 27th at CVPR 2013 in Portland, Oregon.

No comments:

Post a Comment