Finding Things: Image Parsing with Regions and Per-Exemplar Detectors. Joseph Tighe and Svetlana Lazebnik, CVPR 2013
Idea #1: "Segmentation-driven" Image Parsing
The idea of using bottom-up segmentation to parse scenes is not new. Superpixels (very small segments which are likely to contain a single object category) coupled with some machine learning can be used to produce a coherent scene parsing system; however, the boundaries of objects are not as precise as one would expect. This shortcoming stems from the smoothing terms used in random field inference and because generic category-level classifiers have a hard time reasoning about the extent of an object. To see how superpixel-based scene parsing works, check out the video from their older paper from ECCV2010:
Idea #2: Per-exemplar segmentation mask transfer
For me, the most exciting thing about this paper is the integration of the segmentation mask transfer from exemplar-based detections. The ideas is quite simple: each detector is exemplar-specific and is thus equipped with its own (precise) segmentation mask. When you produce detections from such exemplar-based systems, you can immediately transfer segmentations in a purely top-down manner. This is what I have been trying to get people excited about for years! Congratulations to Joseph Tighe for incorporating these ideas into a full-blow image interpretation system. To see an example of mask transfer, check out the figure below.
Their system produces a per-pixel labeling of the input image, and as you can see below, the results are quite good. Here are some more outputs of their system as compared to solely region-based as well as solely detector-based systems. Using per-exemplar detectors clearly complements superpixel-based "segmentation-driven" approaches.
This paper will be presented as an oral in the Orals 3C session called "Context and Scenes" to be held on Thursday, June 27th at CVPR 2013 in Portland, Oregon.