Wednesday, April 18, 2012

One Part Basis to Rule them All: Steerable Part Models

Last week, some of us vision hackers at MIT started an Object Recognition Reading Group.  The group is currently in stealth-mode, but our goal is to analyze, criticize, and re-synthesize ideas from the object detection/recognition community.  To inaugurate the group, I covered Hamed Pirsiavash's Steerable Part Models paper from the upcoming CVPR 2012 conference.  As background reading, I had to go over the mathematical basics of learning with tensors (i.e., multidimensional arrays) which were outlined in their earlier NIPS 2009 paper, Bilinear Classifiers for Visual Recognition.  After reading up on their work, I have a better grasp of what the trace operator actually does.  It is nothing more than a Hermitian inner product defined between the space of linear operators from C^N to C^M (see post here for geometric interpretations of the trace).

Hamed Pirsiavash, Deva Ramanan, "Steerable part models", CVPR 2012

"Our representation can be seen as an approach to sharing parts." 
-- H. Pirisiavash and D. Ramanan

The idea behind this paper is relatively simple -- instead of learning category-specific part-models, learn a part-basis from which all category-specific part models come from.  Consider the different parts learned from a deformable part model (see Felzenszwalb's DPM page for more info about DPMs) and their depiction below.  If you take a close look you see that the parts are quite general, and it makes sense to assume that there is a finite basis from which these parts come from.

Parts from a Part-model

The model learns a steerable basis by factoring the matrix of all part models into the product of two low rank matrices, and because the basis is shared across categories, this performs both dimensionality reduction (like to help prevent over-fitting as well as speed up the final detectors) and sharing (likely to boost performance).

The learned steerable basis

While the objective function is not convex, it can be tackled via a simple alternating optimization algorithm where the resulting sub-objectives are convex and can be optimized using off-the-shelf Linear SVM solvers.  They call this property bi-convexity, and it doesn't guarantee finding the global optimum, just makes using standard tools easy.

While the results on PASCAL VOC2007, do not show an improvement in performance (VOC2007 is not a very good dataset for sharing as there are only a few category combinations which should in theory benefit significantly from sharing (e.g., bicycle and motorbike)), they show a significant computational speed up.  Below is a picture of the part-based car model from Felzenszwalb et al, as well as the one from their steerable basis approach.  Note that the HOG visualizations look very similar.

In conclusion, this is one paper worthy of checking out if you are serious about object recognition research.  The simplicity of the approach is a strong point, and if you are a HOG-hacker (like many of us these days) then you will be able to understand the paper without a problem.

1 comment:

  1. Andrej Karpathy3:55 AM

    I see strong connections with this general line of work (taking HOG cells and learning dictionaries) and deep learning. A single HOG cell is exactly equivalent to a one-layer conv net. Specifically, with simple cells in layer (S1) with 9 linear (rectified) filters, each detecting an edge of different orientation, and a complex layer cell (C1) that does average pooling across the simple cell responses. A deep network would go on to learn the second simple cells layer (S2) on top of these C1 cells by learning a dictionary again in a slightly extended receptive field. Here, that is equivalent to learning a dictionary for a grid of HOG Cells. In other words, there is the same theme of building alternating layers of AND operations and OR operations -- selective operations and invariant ones -- simple cells or complex cells, etc. It's funny how these two communities have their own dictionaries, but there is often a 1-1 translation and they are computing basically equivalent operations on images.

    In this particular paper though, it seems that the dictionary is not obtained by simply running a version of Vector Quantization on the hog cell grids, so I'm not quite sure about the mapping there. Will have to read up a bit more on this.

    Though, btw, in deep learning people sometimes tend to prefer using MAX operation as opposed to AVG for complex cells, as they sometimes find it works better. I wonder -- what kind of results do you get if, in the HOG code, you don't do a histogram of edges, but instead set each bin to be max of edge response in the particular direction?