Showing posts with label image interpretation. Show all posts
Showing posts with label image interpretation. Show all posts

Friday, June 21, 2013

[Awesome@CVPR2013] Image Parsing with Regions and Per-Exemplar Detectors

I've been making an inventory of all the awesome papers at this year's CVPR 2013 conference, and one which clearly stood out was Tighe & Lazebnik's paper titled:


This paper combines ideas from segmentation-based "scene parsing" (see the below video for the output of their older ECCV2010 SuperParsing system) as well as per-exemplar detectors (see my Exemplar-SVM paper, as well as my older Recognition by Association paper).  I have worked and published in these two separate lines of research, so when I tell you that this paper is worthy of reading, you should at least take a look.  Below I outline the two ideas which are being synthesized in this paper, but for all details you should read their paper (PDF link).  See the overview figure below:


Idea #1: "Segmentation-driven" Image Parsing
The idea of using bottom-up segmentation to parse scenes is not new.  Superpixels (very small segments which are likely to contain a single object category) coupled with some machine learning can be used to produce a coherent scene parsing system; however, the boundaries of objects are not as precise as one would expect.  This shortcoming stems from the smoothing terms used in random field inference and because generic category-level classifiers have a hard time reasoning about the extent of an object.  To see how superpixel-based scene parsing works, check out the video from their older paper from ECCV2010:


Idea #2: Per-exemplar segmentation mask transfer
For me, the most exciting thing about this paper is the integration of the segmentation mask transfer from exemplar-based detections.  The ideas is quite simple: each detector is exemplar-specific and is thus equipped with its own (precise) segmentation mask.  When you produce detections from such exemplar-based systems, you can immediately transfer segmentations in a purely top-down manner.  This is what I have been trying to get people excited about for years!  Congratulations to Joseph Tighe for incorporating these ideas into a full-blow image interpretation system.  To see an example of mask transfer, check out the figure below.


Their system produces a per-pixel labeling of the input image, and as you can see below, the results are quite good.  Here are some more outputs of their system as compared to solely region-based as well as solely detector-based systems.  Using per-exemplar detectors clearly complements superpixel-based "segmentation-driven" approaches.



This paper will be presented as an oral in the Orals 3C session called "Context and Scenes" to be held on Thursday, June 27th at CVPR 2013 in Portland, Oregon.

Monday, January 18, 2010

Understanding versus Interpretation -- a philosophical distinction

Today I want to bring up an interesting discussion regarding the connotation of the word "understanding" versus "interpretation," particularly in the context of "scene understanding" versus "scene interpretation." While many vision researchers use these terms interchangeably, I think it is worthwhile to make the distinction, albeit a philosophical one.

On Understanding
While everybody knows that the goal of computer vision is to recognize all of the objects in an image, there is plenty of disagreement about how to represent objects and recognize them in the image. There is a physicalist account (from Wikipedia: Physicalism is a philosophical position holding that everything which exists is no more extensive than its physical properties), where the goal of vision is to reconstruct veridical properties of the world. This view is consistent with the realist stance in philosophy (think back to Philosophy 101) -- there exists a single observer-independent 'ground-truth' regarding the identities of all of the objects contained in the world. The notion of vision as measurement is very strong under this physicalist account. The stuff of the world is out there just waiting to be grasped! I think the term "understanding" fits very well into this truth-driven account of computer vision.

On interpretation
The second view, a postmodern and anti-realist one, is of vision as a way of interpreting scenes. The shift is from veridical recovery of the properties of the world from an image (measurement) to the observer-dependent interpretation of the input stimulus. Under this account, there is no need to believe in a god's eye 'objective' view of the world. Image interpretation is the registration of an input image with a vast network of past experience, both visual and abstract. The same person can vary their own interpretation of an input as time passes and the internal knowledge based has evolved. Under this view, two distinct robots could provide very useful yet distinct 'image interpretations' of the same input image. The main idea is that different robots could have different interpretation-spaces, that is they could obtain incommensurable (yet very useful!) interpretations of the same image.

It has been argued by Donald Hoffman (Interface Theory of Perception) that there is no reason why we should expect evolution to have driven humans towards veridical perception. In fact, Hoffman argues that natures drives veridical perception towards extinction and it only makes sense to speak of perception as guiding agents towards pragmatic interpretations of their environment.

In philosophy of science, there is the debate of whether the field of physics is unraveling some ultimate truth about the world versus physics painting a coherent and pragmatic picture of the world. I've always viewed science as an art and I embrace my anti-realist stance -- which has been shaped by Thomas Kuhn, William James, and many others. While my scientific interests have currently congealed in computer vision, it is no surprise that I'm finding conceptual agreement between my philosophy of science and my concrete research efforts in object recognition.