Friday, June 19, 2009

A Shift of Focus: Relying on Prototypes versus Support Vectors

The goal of today's blog post is to outline an important difference between traditional categorization models in Psychology such as Prototype Models, and Support Vector Machine (SVM) based models.

When solving a SVM optimization problem in the dual (given a kernel function), the answer is represented as a set of weights associated with each of the data-centered kernels. In the Figure above, a SVM is used to learn a decision boundary between the blue class (desks) and the red class (chairs). The sparsity of such solutions means that only a small set of examples are used to define the class decision boundary. All points on the wrong side of the decision boundary and barely yet correctly classified points (within the margin) have non-zero weights. Many Machine Learning researchers get excited about the sparsity of such solutions because in theory, we only need to remember a small number of kernels for test time. However, the decision boundary is defined with respect to the problematic examples (misclassified and barely classified ones) and not the most typical examples. The most typical (and easy to recognize) examples are not even necessary to define the SVM decision boundary. Two data sets that have the same problematic examples, but significant differences in the "well-classified" examples might result in the same exact SVM decision boundary.

My problem with such boundary-based approaches is that by focusing only on the boundary between classes useful information is lost. Consider what happens when two points are correctly classified (and fall well beyond the margin on their correct side): the distance-to-decision-boundary is not a good measure of class membership. By failing to capture the "density" of data, the sparsity of such models can actually be a bad thing. As with discriminative methods, reasoning about the support vectors is useful for close-call classification decisions, but we lose fine-scale membership details (aka "density information") far from the decision surface.

In a single-prototype model (pictured above), a single prototype is used per class and distances-to-prototypes implicitly define the decision surface. The focus is on exactly the 'most confident' examples, which are the prototypes. Prototypes are created during training -- if we fit a Gaussian distribution to each class, the mean becomes the prototype. Notice that by focusing on Prototypes, we gain density information near the prototype at the cost of losing fine-details near the decision boundary. Single-Prototype models generally perform worse on forced-choice classification tasks when compared to their SVM-based discriminative counterparts; however, there are important regimes where too much emphasis on the decision boundary is a bad thing.

In other words, Prototype Methods are best and what they were designed to do in categorization, namely capture Typicality Effects (see Rosch). It would be interesting to come up with more applications where handing Typicality Effects and grading membership becomes more important than making close-call classification decision. I suspect that in many real-world information retrieval applications (where high precision is required and low recall tolerated) going beyond boundary-based techniques is the right thing to do.


  1. Gaussian processes (and their cousins relevance vector machines) are just like SVMs in every way, except that they do exactly what you want: classes are described by their centers, not their outliers.

  2. Thanks for the quick comment! I will definitely look at GPs.

  3. a related observation (i am witnessing) is a gradual shift towards query-based models in ML e.g., LWR, KNN-SVM etc

  4. Interesting observation Santosh. It seems to me that if you have orders of magnitude more training data than testing data, then it doesn't make sense to solve a gigantic offline optimization problem. A local-based approach can hone in parts of the space relevant to a prediction and come up with a decision on the fly.

    Santosh, do you think there are other reasons why query-based models are gaining popularity?

  5. The main problem of working with one prototype per class is that there may be more than one visual prototypes for a single semantic class. The multiple kernel approach of SVMs solves this problem naturally, do you have any suggestion about how to discern how many visual prototypes does a semantic class require ?

  6. Xavier, I agree with you that representing a visual category with one prototype has severe limitations. The goal of this post was to discuss the difference between something like Naive Bayes classifier and SVM linear classifier -- this is basically one version of the generative vs. discriminative debate.

    Regarding the number of visual prototypes, I think your answer will be very much dependent on the *representation* of instances. For example, it is well known that for car detection via HoG (Histogram of Gradient) features, it makes sense to cut up the instances based on aspect (left view, frontal view, etc). Having a different prototype for each different aspect makes sense in this case, because a HoG template cannot make fronts of cars and sides of cars similar.

    The problem is that when you start using more abstract features, such as histograms of local features, it is not clear how many prototypes should be used.

    An alternative approach, which I personally support, is to use an exemplar-based approach -- where all of the training instances are retained. Of course, additional tricks for removing redundant exemplars could be beneficial to reduce the complexity.

    Different visual categories have different intraclass variations. I think having around 5 prototypes for the car category and a single prototype for the road category makes sense -- but I have no clue how many prototypes/exemplars we would need to recognize cats in arbitrary poses!