I would like to introduce two papers about Constrained Parametric Min-Cuts from C. Sminchisescu's group. These papers are very relevant to my research direction (which lies at the intersection of segmentation and recognition). Like my own work, these papers are about segmentation for recognition's sake. The segmentation algorithm proposed in the paper is a sort of "segment sliding approach", where many binary graph-cuts optimization problems are solved for different Grab-Cut style initializations. These segments are then scored using a learned scoring function -- think regression versus classification. They show that these top segments are actually quite meaningful and correspond to object boundaries really well. Finally a tractable number of top hypothesis (still overlapping at this stage), are piped into a recognition engine.
The idea that features derived from segments are better for recognition than features from the spatial support of a sliding rectangle resonates in all of my papers. Regarding these CVPR2010 papers, I like their ideas of learning a category-free "segmentation-function" and the sort of multiple-segmentation version of this algorithm is very appealing. If I remember correctly, the idea of learning a segmentation function comes to us from X. Ren, and the idea of using multiple segmentation comes from D. Hoiem. These papers are a cool new idea utilizing both insights.
J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In CVPR 2010.
F. Li, J. Carreira, and C. Sminchisescu. Object Recognition as Ranking Holistic Figure-Ground Hypotheses. In CVPR 2010.
-------
Spotlights for these papers are during these tracks at CVPR2010:
Object Recognition III: Similar Shapes
Segmentation and Grouping II: Semantic Segmentation tracks
Deep Learning, Computer Vision, and the algorithms that are shaping the future of Artificial Intelligence.
Showing posts with label image segmentation. Show all posts
Showing posts with label image segmentation. Show all posts
Sunday, June 13, 2010
Sven Dickinson at POCV 2010, ACVHL Tomorrow
This morning, Sven Dickinson gave a talk to start the POCV 2010 Workshop at CVPR2010. For those of you who might not know, POCV stands for Perceptual Organization in Computer Vision. While segmentation can be thought of as a perceptual grouping process, contiguous regions don't have to be the end product of a meaningful perceptual grouping process. There are many popular and useful algorithms which group non-accidental contours yet come short of a full-blown image segmentation.
The title of Dickinson's talk was "The Role of Intermediate Shape Priors in Perceptual Grouping and Image Abstraction." In the beginning of his talk, Sven pointed out how perceptual organization was at its prime in the mid 90s and declined in the 2000s due to the popularity of machine learning and the "detection" task. He believes that good perceptual grouping is what is going to make vision scale -- that is, without first squeezing out all that we can out of the bottom level we are doomed to fail.
Dickinsons showed some nice results from his most recent research efforts where objects are broken down into generic "parts" -- this reminded me of Biederman's geons, although Sven's fitting is done in the 2D image plane. Sven emphasized that successful shape primitives must be category-independent if we are to have scalable recognition of thousands of visual concepts in images. This is much different than the mainstream per-category object detection task which has been popularized by contests such as the PASCAL VOC.
While I personally believe that there is a good place for perceptual organization in vision, I wouldn't view it as the Holy Grail. It is perhaps the Holy Bridge we must inevitably cross on the way to finding the Holy Grail. I believe that for full-grown fully-functional members of society, our ability to effortlessly cope with the world is chiefly due to its simplicity and repeatability, and not due to some amazing internal perceptual organization algorithm. Perhaps it is when we were children -- viewing the world through a psychedelic fog of innocence -- that perceptual grouping helped us cut up the world into meaningful entities.
A common theme in Sven's talk was the idea of Learning to Group in a category-independent way. This means that all of the successes of Machine Learning aren't thrown out the door, and this appears to a quite different way of grouping than what has been done in the 1970s.
Tomorrow I will be at ACVHL Workshop "Advancing Computer Vision with Humans in the Loop". I haven't personally "turked" yet, but I feel I will be jumping on the bandwagon soon. Anyways, the keynote speakers should make for an awesome workshop. They do not need introductions: David Forsyth, Aude Oliva, Fei-Fei Li, Antonio Torralba, and Serge Belongie -- all influential visionaries.
The title of Dickinson's talk was "The Role of Intermediate Shape Priors in Perceptual Grouping and Image Abstraction." In the beginning of his talk, Sven pointed out how perceptual organization was at its prime in the mid 90s and declined in the 2000s due to the popularity of machine learning and the "detection" task. He believes that good perceptual grouping is what is going to make vision scale -- that is, without first squeezing out all that we can out of the bottom level we are doomed to fail.
Dickinsons showed some nice results from his most recent research efforts where objects are broken down into generic "parts" -- this reminded me of Biederman's geons, although Sven's fitting is done in the 2D image plane. Sven emphasized that successful shape primitives must be category-independent if we are to have scalable recognition of thousands of visual concepts in images. This is much different than the mainstream per-category object detection task which has been popularized by contests such as the PASCAL VOC.
While I personally believe that there is a good place for perceptual organization in vision, I wouldn't view it as the Holy Grail. It is perhaps the Holy Bridge we must inevitably cross on the way to finding the Holy Grail. I believe that for full-grown fully-functional members of society, our ability to effortlessly cope with the world is chiefly due to its simplicity and repeatability, and not due to some amazing internal perceptual organization algorithm. Perhaps it is when we were children -- viewing the world through a psychedelic fog of innocence -- that perceptual grouping helped us cut up the world into meaningful entities.
A common theme in Sven's talk was the idea of Learning to Group in a category-independent way. This means that all of the successes of Machine Learning aren't thrown out the door, and this appears to a quite different way of grouping than what has been done in the 1970s.
Tomorrow I will be at ACVHL Workshop "Advancing Computer Vision with Humans in the Loop". I haven't personally "turked" yet, but I feel I will be jumping on the bandwagon soon. Anyways, the keynote speakers should make for an awesome workshop. They do not need introductions: David Forsyth, Aude Oliva, Fei-Fei Li, Antonio Torralba, and Serge Belongie -- all influential visionaries.
Monday, April 05, 2010
Exciting Computer Vision papers from Kristen Grauman's UT-Austin Group
Back in 2005, I remember meeting Kristen Grauman at MIT's accepted PhD student open house. Back then she was a PhD student under Trevor Darrell (and is known for her work on the Pyramid Match Kernel), but now she has her own vision group at UT-Austin. She is the the advisor behind many cool vision projects there, and here are a few segmenatation/categorization related papers from the upcoming CVPR2010 conference. I look forward to checking out these papers because they are relevant to my own research interests. NOTE: some of the papers links are still not up -- I just used the links from Kristen's webpage.
Object-Graphs for Context-Aware Category Discovery. Y. J. Lee and K. Grauman
Collect-Cut: Segmentation with Top-Down Cues Discovered in Multi-Object Images. Y. J. Lee and K. Grauman
Tuesday, October 13, 2009
What is segmentation-driven object recognition?
In this post, I want to discuss what the term "segmentation-driven object recognition" means to me. While segmentation-only and object recognition-only research papers are ubiquitous in vision conferences (such as CVPR , ICCV, and ECCV), a new research direction which uses segmentation for recognition has emerged. Many researchers pushing in this direction are direct descendants of the great J. Malik such as Belongie, Efros, Mori, and many others. The best example of segmentation-driven recognition can be found in Rabinovich's Objects in Context paper. The basic idea in this paper is to compute multiple stable segmentations of an input image using Ncuts and use a dense probabilistic graphical model over segments (combining local terms and segment-segment context) to recognize objects inside those regions.
Segmentation-only research focuses on the actual image segmentation algorithms -- where the output of a segmentation algorithm is a partition of a 2D image into contiguous regions. Algorithms such as mean-shift, normalized cuts, as well as 100s of probabilistic graphical models can be used produce such segmentations. The Berkeley group (in an attempt to salvage "mid-level" vision) has been working diligently on boundary detection and image segmentation for over a decade.
Recognition-only research generally focuses on new learning techniques or building systems to perform well on detection/classification benchmarks. The sliding window approach coupled with bag-of-words models has dominated vision and is the unofficial method of choice.
It is easy to relax the bag-of-words model, so let's focus on rectangles for a second. If we do not use segmentation, the world of objects will have to conform to sliding rectangles and image parsing will inevitably look like this:
(Taken from Bryan Russell's Object Recognition by Scene Alignment paper).
It has been argued that segmentation is required to move beyond the world of rectangular windows if we are to successfully break up images into their constituent objects. While some objects can be neatly approximated by a rectangle in the 2D image plane, to explain away an arbitrary image free-form regions must be used. I have argued this point extensively in my BMVC 2007 paper, and the interesting result was that multiple segmentations must by used if we want to produce reasonable segments. Sadly, segmentation is generally not good enough by itself to produce object-corresponding regions.
(Here is an example of the Mean Shift algorithm where to get a single cow segment two adjacent regions had to be merged.)
The question of how to use segmentation algorithms for recognition is still open. If segmentation could tessellate an image into "good" regions in one-shot then the goal of recognition is to simply label these regions and life becomes simple. This is unfortunately far from reality. While blobs of homogeneous appearance often correspond to things like sky, grass, and road, many objects do not pop out as a single segment. I have proposed using a soup of such segments that come from different algorithms being ran with different parameters (and even merging pairs and triplets of such segments!) but this produces a large number of regions and thus making the recognition task harder.
Using a soup of segments, a small fraction of the regions might be of high quality; however, recognition now has to throw away 1000s of misleading segments. Abhinav Gupta, a new addition to CMU vision community, has pointed out that if we want to model context between segments (and for object-object relationships this means a quadratic dependence on the number of segments), using a large soup of segments in simply not tractable. Either the number of segments or the number of context interactions has to be reduced in this case, but non-quadratic object-object context models are an open question.
In conclusion, the representation used by segmentation (that of free-form regions) is superior to sliding window approaches which utilize rectangular windows. However, off-the-shelf segmentation algorithms are still lacking with respect to their ability to generate such regions. Why should an algorithm that doesn't know anything about objects be able to segment out objects? I suspect that in the upcoming years we will see a flurry of learning-based segmenters that provide a blend of recognition and bottom-up grouping, and I envision such algorithms to be used a strictly non-feedforward way.
Segmentation-only research focuses on the actual image segmentation algorithms -- where the output of a segmentation algorithm is a partition of a 2D image into contiguous regions. Algorithms such as mean-shift, normalized cuts, as well as 100s of probabilistic graphical models can be used produce such segmentations. The Berkeley group (in an attempt to salvage "mid-level" vision) has been working diligently on boundary detection and image segmentation for over a decade.
Recognition-only research generally focuses on new learning techniques or building systems to perform well on detection/classification benchmarks. The sliding window approach coupled with bag-of-words models has dominated vision and is the unofficial method of choice.
It is easy to relax the bag-of-words model, so let's focus on rectangles for a second. If we do not use segmentation, the world of objects will have to conform to sliding rectangles and image parsing will inevitably look like this:
It has been argued that segmentation is required to move beyond the world of rectangular windows if we are to successfully break up images into their constituent objects. While some objects can be neatly approximated by a rectangle in the 2D image plane, to explain away an arbitrary image free-form regions must be used. I have argued this point extensively in my BMVC 2007 paper, and the interesting result was that multiple segmentations must by used if we want to produce reasonable segments. Sadly, segmentation is generally not good enough by itself to produce object-corresponding regions.
(Here is an example of the Mean Shift algorithm where to get a single cow segment two adjacent regions had to be merged.)
The question of how to use segmentation algorithms for recognition is still open. If segmentation could tessellate an image into "good" regions in one-shot then the goal of recognition is to simply label these regions and life becomes simple. This is unfortunately far from reality. While blobs of homogeneous appearance often correspond to things like sky, grass, and road, many objects do not pop out as a single segment. I have proposed using a soup of such segments that come from different algorithms being ran with different parameters (and even merging pairs and triplets of such segments!) but this produces a large number of regions and thus making the recognition task harder.
Using a soup of segments, a small fraction of the regions might be of high quality; however, recognition now has to throw away 1000s of misleading segments. Abhinav Gupta, a new addition to CMU vision community, has pointed out that if we want to model context between segments (and for object-object relationships this means a quadratic dependence on the number of segments), using a large soup of segments in simply not tractable. Either the number of segments or the number of context interactions has to be reduced in this case, but non-quadratic object-object context models are an open question.
In conclusion, the representation used by segmentation (that of free-form regions) is superior to sliding window approaches which utilize rectangular windows. However, off-the-shelf segmentation algorithms are still lacking with respect to their ability to generate such regions. Why should an algorithm that doesn't know anything about objects be able to segment out objects? I suspect that in the upcoming years we will see a flurry of learning-based segmenters that provide a blend of recognition and bottom-up grouping, and I envision such algorithms to be used a strictly non-feedforward way.
Sunday, August 10, 2008
What is segmentation? What is image segmentation?
According to Merriam-Webster, segmentation is "the process of dividing into segments" and a segment is "a separate piece of something; a bit, or a fragment." This is a rather broad definition which suggests that segmentation is nothing mystical -- it is just taking a whole and partitioning it into pieces. One can segment sentences, time periods, tasks, inhabitants of a country, and digital images.
Segmentation is a term that often pops up in technical fields, such as Computer Vision. I have attempted to write a short article on Knol about Image Segmentation and how it pertains to Computer Vision. Deciding to abstain from discussing specific algorithms -- which might be of interest to graduate students and not the population as a whole -- I instead target the high-level question, "Why segment images?" The answer, according to me, is that image segmentation (and any other image processing task) should be performed solely assist object recognition and image understanding.
Segmentation is a term that often pops up in technical fields, such as Computer Vision. I have attempted to write a short article on Knol about Image Segmentation and how it pertains to Computer Vision. Deciding to abstain from discussing specific algorithms -- which might be of interest to graduate students and not the population as a whole -- I instead target the high-level question, "Why segment images?" The answer, according to me, is that image segmentation (and any other image processing task) should be performed solely assist object recognition and image understanding.
Subscribe to:
Posts (Atom)