Tombone's Computer Vision Blog: distance function learning

Showing posts with label distance function learning. Show all posts

Monday, October 19, 2009

Scene Prototype Models for Indoor Image Recognition

In today's post I want to briefly discuss a computer vision paper which has caught my attention.

In the paper Recognizing Indoor Scenes, Quattoni and Torralba build a scene recognition system for categorizing indoor images. Instead of performing learning directly in descriptor space (such as the GIST over the entire image), the authors use a "distance-space" representation. An image is described by a vector of distances to a large number of scene prototypes. A scene prototype consists of a root feature (the global GIST) as well as features belonging to a small number of regions associated with the prototype. One example of such a prototype might be an office scene with a monitor region in the center of the image and a keyboard region below it -- however the ROIs (which can be thought of as parts of the scene) are often more abstract and do not neatly correspond to a single object.

The learning problem (which is solved once per category) is then to find the internal parameters of each prototype as well as the per-class prototype distance weights which are used for classification. From a distance function learning point of view, it is rather interesting to see distances to many exemplars being used as opposed to the distance to a single focal exemplar.

Although the authors report results on the image categorization task it is worthwhile to ask if scene prototypes could be used for object localization. While it is easy to be the evil genius and devise an image that is unique enough such that it doesn't conform to any notion of a prototype, I wouldn't be surprised if 80% of the images we encounter on the internet conform to a few hundred scene prototypes. Of course the problem of learning such prototypes from data without prototype-labeling (which requires expert vision knowledge) is still open. Overall, I like the direction and ideas contained in this research paper and I'm looking forward to see how these ideas develop.

Tuesday, August 18, 2009

exciting stuff at BAVM2009 #1: joint regularization

There were a couple of cool computer vision ideas that I was exposed to at BAVM2009. First, Trevor Darrell mentioned some cool work by Ariadna Quattoni on L1/L_inf regularization. The basic idea, which has also recently been used in other ICML 2009 works such as Han Liu and Mark Palatucci's Blockwise Coordinate Descent, is that you want to regularize across a bunch of problems. This is sometimes referred to as multi-task learning. Imagine solving two SVM optimization problems to find linear classifiers for detecting cars and bicycles in images. It is reasonable to expect that in high dimensional spaces these two classifiers will something in common. To provide more intuition, it might be the case that your feature set provides many irrelevant variables and when learning these classifiers independently much work is spent on removing these dumb variables. By doing some sort of joint regularization (or joint feature selection), you can share information across seemingly distinct classification problems.

In fact, when I was talking about my own CVPR08 work Daphne Koller suggested that this sort of regularization might work for my task of learning distance functions. However, I am currently exploiting the independence that I get from not doing any cross-problem regularization by solving the distance function learning problems independently. While regularization might be desirable, it couples problems and it might be difficult to solve hundreds of thousands of such problems jointly.

I will mention some other cool works in future posts.

Friday, June 19, 2009

A Shift of Focus: Relying on Prototypes versus Support Vectors

The goal of today's blog post is to outline an important difference between traditional categorization models in Psychology such as Prototype Models, and Support Vector Machine (SVM) based models.

When solving a SVM optimization problem in the dual (given a kernel function), the answer is represented as a set of weights associated with each of the data-centered kernels. In the Figure above, a SVM is used to learn a decision boundary between the blue class (desks) and the red class (chairs). The sparsity of such solutions means that only a small set of examples are used to define the class decision boundary. All points on the wrong side of the decision boundary and barely yet correctly classified points (within the margin) have non-zero weights. Many Machine Learning researchers get excited about the sparsity of such solutions because in theory, we only need to remember a small number of kernels for test time. However, the decision boundary is defined with respect to the problematic examples (misclassified and barely classified ones) and not the most typical examples. The most typical (and easy to recognize) examples are not even necessary to define the SVM decision boundary. Two data sets that have the same problematic examples, but significant differences in the "well-classified" examples might result in the same exact SVM decision boundary.

My problem with such boundary-based approaches is that by focusing only on the boundary between classes useful information is lost. Consider what happens when two points are correctly classified (and fall well beyond the margin on their correct side): the distance-to-decision-boundary is not a good measure of class membership. By failing to capture the "density" of data, the sparsity of such models can actually be a bad thing. As with discriminative methods, reasoning about the support vectors is useful for close-call classification decisions, but we lose fine-scale membership details (aka "density information") far from the decision surface.

In a single-prototype model (pictured above), a single prototype is used per class and distances-to-prototypes implicitly define the decision surface. The focus is on exactly the 'most confident' examples, which are the prototypes. Prototypes are created during training -- if we fit a Gaussian distribution to each class, the mean becomes the prototype. Notice that by focusing on Prototypes, we gain density information near the prototype at the cost of losing fine-details near the decision boundary. Single-Prototype models generally perform worse on forced-choice classification tasks when compared to their SVM-based discriminative counterparts; however, there are important regimes where too much emphasis on the decision boundary is a bad thing.

In other words, Prototype Methods are best and what they were designed to do in categorization, namely capture Typicality Effects (see Rosch). It would be interesting to come up with more applications where handing Typicality Effects and grading membership becomes more important than making close-call classification decision. I suspect that in many real-world information retrieval applications (where high precision is required and low recall tolerated) going beyond boundary-based techniques is the right thing to do.

Tuesday, April 08, 2008

Recognition by Association via Learning Per-exemplar Distances

Tomasz Malisiewicz, Alexei A. Efros. Recognition by Association via Learning Per-exemplar Distances. In CVPR, June 2008.

Abstract:

We pose the recognition problem as data association. In this setting, a novel object is explained solely in terms of a small set of exemplar objects to which it is visually similar. Inspired by the work of Frome et al., we learn separate distance functions for each exemplar; however, our distances are interpretable on an absolute scale and can be thresholded to detect the presence of an object. Our exemplars are represented as image regions and the learned distances capture the relative importance of shape, color, texture, and position features for that region. We use the distance functions to detect and segment objects in novel images by associating the bottom-up segments obtained from multiple image segmentations with the exemplar regions. We evaluate the detection and segmentation performance of our algorithm on real-world outdoor scenes from the LabelMe dataset and also show some promising qualitative image parsing results.

http://www.cs.cmu.edu/~tmalisie/projects/cvpr08/