In his Philosophical Investigations, Wittgenstein argues against abstraction -- via several thought experiments he strives to annihilate the view that during their lives humans develop neat and consistent concepts in their minds (akin to building a dictionary). He criticizes the commonplace notions of meaning and concept formation (as were commonly used in philosophical circles at the time) and has contributed greatly to my own ideas regarding categorization in computer vision.
Wittgenstein asks the reader to come up with the definition of the concept "game." While we can look up the definition of "game" in a dictionary, we can't help but feel that any definition will be either too narrow or too broad. The number of exceptions we would need in a single definition scales as the number of unique games we've been exposed to. His point wasn't that game cannot be defined -- it was that the lack of a formal definition does not prevent us from using the word "game" correctly. Think of a child growing up and being exposed to multi-player games, single-player games, fun games, competitive games, games that are primarily characterized by their display of athleticism (aka sports or Olympic Games). Let's not forget activities such as courting and the Stock Market which are also referred to as "games." Wittgenstein criticizes the idea that during our lives we somehow determine what is common between all of those examples of games and form an abstract concept of game which determines how we categorize novel activities. For Wittgenstein, our concept of game is not much more than our exposure to activities labeled as games and our ability to re-apply the word game in future context.
Wittgenstein's ideas are an antithesis to Platonic Realism and Aristotle's Classical notion of Categories, where concepts/categories are pure, well-defined, and possess neatly defined boundaries. For Wittgenstein, experience is the anchor which allows us to measure the similarity between a novel activity and past activities referred to as games. Maybe the ineffability of experience isn't because internal concepts are inaccessible to introspection, maybe there is simply no internal library of concepts in the first place.
An experience-based view of concepts (or as my advisor would say, a data-driven theory of concepts) suggests that there is no surrogate for living a life rich with experience. While this has implications for how one should live their own life, it also has implications in the field of artificial intelligence. The modern enterprise of "internet vision" where images are labeled with categories and fed into a classifier has to be questioned. While I have criticized categories, there are also problems with a purely data-driven large-database-based approach. It seems that a good place to start is by pruning away redundant bits of information; however, judging what is redundant and how is still an open question.
Deep Learning, Computer Vision, and the algorithms that are shaping the future of Artificial Intelligence.
Monday, October 26, 2009
Monday, October 19, 2009
Scene Prototype Models for Indoor Image Recognition
In today's post I want to briefly discuss a computer vision paper which has caught my attention.
In the paper Recognizing Indoor Scenes, Quattoni and Torralba build a scene recognition system for categorizing indoor images. Instead of performing learning directly in descriptor space (such as the GIST over the entire image), the authors use a "distance-space" representation. An image is described by a vector of distances to a large number of scene prototypes. A scene prototype consists of a root feature (the global GIST) as well as features belonging to a small number of regions associated with the prototype. One example of such a prototype might be an office scene with a monitor region in the center of the image and a keyboard region below it -- however the ROIs (which can be thought of as parts of the scene) are often more abstract and do not neatly correspond to a single object.
The learning problem (which is solved once per category) is then to find the internal parameters of each prototype as well as the per-class prototype distance weights which are used for classification. From a distance function learning point of view, it is rather interesting to see distances to many exemplars being used as opposed to the distance to a single focal exemplar.
Although the authors report results on the image categorization task it is worthwhile to ask if scene prototypes could be used for object localization. While it is easy to be the evil genius and devise an image that is unique enough such that it doesn't conform to any notion of a prototype, I wouldn't be surprised if 80% of the images we encounter on the internet conform to a few hundred scene prototypes. Of course the problem of learning such prototypes from data without prototype-labeling (which requires expert vision knowledge) is still open. Overall, I like the direction and ideas contained in this research paper and I'm looking forward to see how these ideas develop.
In the paper Recognizing Indoor Scenes, Quattoni and Torralba build a scene recognition system for categorizing indoor images. Instead of performing learning directly in descriptor space (such as the GIST over the entire image), the authors use a "distance-space" representation. An image is described by a vector of distances to a large number of scene prototypes. A scene prototype consists of a root feature (the global GIST) as well as features belonging to a small number of regions associated with the prototype. One example of such a prototype might be an office scene with a monitor region in the center of the image and a keyboard region below it -- however the ROIs (which can be thought of as parts of the scene) are often more abstract and do not neatly correspond to a single object.
The learning problem (which is solved once per category) is then to find the internal parameters of each prototype as well as the per-class prototype distance weights which are used for classification. From a distance function learning point of view, it is rather interesting to see distances to many exemplars being used as opposed to the distance to a single focal exemplar.
Although the authors report results on the image categorization task it is worthwhile to ask if scene prototypes could be used for object localization. While it is easy to be the evil genius and devise an image that is unique enough such that it doesn't conform to any notion of a prototype, I wouldn't be surprised if 80% of the images we encounter on the internet conform to a few hundred scene prototypes. Of course the problem of learning such prototypes from data without prototype-labeling (which requires expert vision knowledge) is still open. Overall, I like the direction and ideas contained in this research paper and I'm looking forward to see how these ideas develop.
Tuesday, October 13, 2009
What is segmentation-driven object recognition?
In this post, I want to discuss what the term "segmentation-driven object recognition" means to me. While segmentation-only and object recognition-only research papers are ubiquitous in vision conferences (such as CVPR , ICCV, and ECCV), a new research direction which uses segmentation for recognition has emerged. Many researchers pushing in this direction are direct descendants of the great J. Malik such as Belongie, Efros, Mori, and many others. The best example of segmentation-driven recognition can be found in Rabinovich's Objects in Context paper. The basic idea in this paper is to compute multiple stable segmentations of an input image using Ncuts and use a dense probabilistic graphical model over segments (combining local terms and segment-segment context) to recognize objects inside those regions.
Segmentation-only research focuses on the actual image segmentation algorithms -- where the output of a segmentation algorithm is a partition of a 2D image into contiguous regions. Algorithms such as mean-shift, normalized cuts, as well as 100s of probabilistic graphical models can be used produce such segmentations. The Berkeley group (in an attempt to salvage "mid-level" vision) has been working diligently on boundary detection and image segmentation for over a decade.
Recognition-only research generally focuses on new learning techniques or building systems to perform well on detection/classification benchmarks. The sliding window approach coupled with bag-of-words models has dominated vision and is the unofficial method of choice.
It is easy to relax the bag-of-words model, so let's focus on rectangles for a second. If we do not use segmentation, the world of objects will have to conform to sliding rectangles and image parsing will inevitably look like this:
(Taken from Bryan Russell's Object Recognition by Scene Alignment paper).
It has been argued that segmentation is required to move beyond the world of rectangular windows if we are to successfully break up images into their constituent objects. While some objects can be neatly approximated by a rectangle in the 2D image plane, to explain away an arbitrary image free-form regions must be used. I have argued this point extensively in my BMVC 2007 paper, and the interesting result was that multiple segmentations must by used if we want to produce reasonable segments. Sadly, segmentation is generally not good enough by itself to produce object-corresponding regions.
(Here is an example of the Mean Shift algorithm where to get a single cow segment two adjacent regions had to be merged.)
The question of how to use segmentation algorithms for recognition is still open. If segmentation could tessellate an image into "good" regions in one-shot then the goal of recognition is to simply label these regions and life becomes simple. This is unfortunately far from reality. While blobs of homogeneous appearance often correspond to things like sky, grass, and road, many objects do not pop out as a single segment. I have proposed using a soup of such segments that come from different algorithms being ran with different parameters (and even merging pairs and triplets of such segments!) but this produces a large number of regions and thus making the recognition task harder.
Using a soup of segments, a small fraction of the regions might be of high quality; however, recognition now has to throw away 1000s of misleading segments. Abhinav Gupta, a new addition to CMU vision community, has pointed out that if we want to model context between segments (and for object-object relationships this means a quadratic dependence on the number of segments), using a large soup of segments in simply not tractable. Either the number of segments or the number of context interactions has to be reduced in this case, but non-quadratic object-object context models are an open question.
In conclusion, the representation used by segmentation (that of free-form regions) is superior to sliding window approaches which utilize rectangular windows. However, off-the-shelf segmentation algorithms are still lacking with respect to their ability to generate such regions. Why should an algorithm that doesn't know anything about objects be able to segment out objects? I suspect that in the upcoming years we will see a flurry of learning-based segmenters that provide a blend of recognition and bottom-up grouping, and I envision such algorithms to be used a strictly non-feedforward way.
Segmentation-only research focuses on the actual image segmentation algorithms -- where the output of a segmentation algorithm is a partition of a 2D image into contiguous regions. Algorithms such as mean-shift, normalized cuts, as well as 100s of probabilistic graphical models can be used produce such segmentations. The Berkeley group (in an attempt to salvage "mid-level" vision) has been working diligently on boundary detection and image segmentation for over a decade.
Recognition-only research generally focuses on new learning techniques or building systems to perform well on detection/classification benchmarks. The sliding window approach coupled with bag-of-words models has dominated vision and is the unofficial method of choice.
It is easy to relax the bag-of-words model, so let's focus on rectangles for a second. If we do not use segmentation, the world of objects will have to conform to sliding rectangles and image parsing will inevitably look like this:
It has been argued that segmentation is required to move beyond the world of rectangular windows if we are to successfully break up images into their constituent objects. While some objects can be neatly approximated by a rectangle in the 2D image plane, to explain away an arbitrary image free-form regions must be used. I have argued this point extensively in my BMVC 2007 paper, and the interesting result was that multiple segmentations must by used if we want to produce reasonable segments. Sadly, segmentation is generally not good enough by itself to produce object-corresponding regions.
(Here is an example of the Mean Shift algorithm where to get a single cow segment two adjacent regions had to be merged.)
The question of how to use segmentation algorithms for recognition is still open. If segmentation could tessellate an image into "good" regions in one-shot then the goal of recognition is to simply label these regions and life becomes simple. This is unfortunately far from reality. While blobs of homogeneous appearance often correspond to things like sky, grass, and road, many objects do not pop out as a single segment. I have proposed using a soup of such segments that come from different algorithms being ran with different parameters (and even merging pairs and triplets of such segments!) but this produces a large number of regions and thus making the recognition task harder.
Using a soup of segments, a small fraction of the regions might be of high quality; however, recognition now has to throw away 1000s of misleading segments. Abhinav Gupta, a new addition to CMU vision community, has pointed out that if we want to model context between segments (and for object-object relationships this means a quadratic dependence on the number of segments), using a large soup of segments in simply not tractable. Either the number of segments or the number of context interactions has to be reduced in this case, but non-quadratic object-object context models are an open question.
In conclusion, the representation used by segmentation (that of free-form regions) is superior to sliding window approaches which utilize rectangular windows. However, off-the-shelf segmentation algorithms are still lacking with respect to their ability to generate such regions. Why should an algorithm that doesn't know anything about objects be able to segment out objects? I suspect that in the upcoming years we will see a flurry of learning-based segmenters that provide a blend of recognition and bottom-up grouping, and I envision such algorithms to be used a strictly non-feedforward way.
Subscribe to:
Posts (Atom)