Tuesday, December 13, 2011

learning to "borrow" examples for object detection. Lim et al, NIPS 2011

Let's say you want to train a cat detector...  If you're anything like me, then you probably have a few labeled cats (~100), as well as a source of non-cat images (~1000).  So what do you do when you can't get any more labeled cats?  (Maybe Amazon's Mechanical Turk service was shut down by the feds, you've got a paper deadline in 48 hours, and money can't get you out of this dilemma.)

1) Realize that there are some labeled dogs/cows/sheep in your dataset!
2) Transform some of the dogs/cows/sheep in your dataset to make them look more like cats. Maybe some dogs are already sufficiently similar to cats! (see cheezburger.com image below)
3) Use a subset of those transformed dogs/cows/sheep examples as additional positives in your cat detector!

Some dogs just look like cats! (and vice-versa)

Using my own internal language, I view this phenomenon as "exemplar theft."  But not the kind of theft which sends you to prison, 'tis the kind of theft which gives you best-paper prizes at your local conference.

Note that this was the answer provided by the vision hackers at MIT in their most recent paper, "Transfer Learning by Borrowing Examples for Multiclass Object Detection," which was just presented at this year's big machine learning-oriented NIPS conference, NIPS 2011. See the illustration from the paper below, which depicts this type of "example borrowing"-sharing for some objects in the SUN09 dataset.

The paper empirically demonstrates that instead of doing transfer learning (also known as multi-task learning) the typical way (regularizing weight vectors towards each other), it is beneficial to simply borrow a subset of (transformed) examples from a related class.  Of course the problem is that we do not know apriori which categories to borrow from, nor which instances from those categories will give us a gain in object detection performance.  The goal of the algorithm is to learn which categories to borrow from, and which examples to borrow.  Not all dogs will help the cat detector.

Here are some examples of popular object categories, the categories from which examples are borrowed, and the categories from which examples are shared once we allow transformations to happen.  Notice the improvement in AP (the higher the average precision the better) when you allow sharing.

They also looked at what happens if you want to improve a single category badass detector on one particular dataset, such as the PASCAL VOC.  Note that these days just about everybody is using the one-and-only "badass detector" and trying to beat it in its own game.   These are the different ways you'll hear people talk about the Latent-SVM-based Deformable Part Model baseline. "badass detector"="state-of-the-art detector"="Felzenszwalb et al. detector"="Pedro's detector"="Deva's detector","Pedro/Deva detector","LDPM detector","DPM detector"

Even if you only care about your favourite dataset, such as PASCAL VOC, you're probably willing to use additional positive data points from another dataset.  In their NIPS paper, the MIT hackers show that simply concatenating datasets is inferior to their clever example borrowing algorithm (mathematical details are found in the paper, but feel free to ask me detailed questions in the comments).  In the figure below, the top row shows cars from one dataset (SUN09), the middle row shows PASCAL VOC 2007 cars, and the bottom row shows which example the SUN09-car detector wants to borrow from PASCAL VOC.

Here the the cross-dataset generalization performance on the SUN09/PASCAL duo.  These results were inspired by the dataset bias work of Torralba and Efros.

In case you're interested, here is the full citation for this excellent NIPS2011 paper:

Joseph J. Lim, Ruslan Salakhutdinov, and Antonio Torralba. "Transfer Learning by Borrowing Examples for Multiclass Object Detection," in NIPS 2011. [pdf]

To get a better understanding of Lim et al's paper, it is worthwhile going back in time to CVPR2011 and taking a quick look the following paper, also from MIT:

Ruslan Salakhutdinov, Antonio Torralba, Josh Tenenbaum. "Learning to Share Visual Appearance for Multiclass Object Detection," in CVPR 2011. [pdf]

Of course, these authors need no introduction (they are all professors at big-time institutions). Ruslan just recently became a Professor and is now back on home turf (where he got his PhD) in Toronto, where he is likely to become the next Hinton.  In my opinion, this "Learning to share" paper was one of the best papers of CVPR 2011.  In this paper they introduced the idea of sharing across rigid classifier templates, and more importantly learning a tree to organize hundreds of object categories.  The tree defines how the sharing is supposed to happen.  The root note is global and shared across all categories, the mid-level nodes can be interpreted as super-categories (i.e., animal, vehicle), and the leaves are the actual object categories (e.g., dog, chair, person, truck).

The coolest thing about the paper is that they use a CRP (chinese restaurant process) to learn a tree without having to specify the number of super-categories!

Finally, we can see some learned weights for three distinct object categories: truck, van, and bucket.  Please see the paper if you want to learn more about sharing -- the clarity of Ruslan's paper is exceptional.

In conclusion, it is pretty clear everybody wants some sort of visual memex. (It is easy to think of the visual memex as a graph where the nodes are individual instances and the edges are relationships between these entities)  Sharing, borrowing, multi-task regularization, exemplar-svms, and a host of other approaches are hinting at the breakdown of the traditional category-based way of approaching the problem of object recognition.  However, our machine learning tools were designed for supervised machine learning with explicit class information.  So what we, the researchers do, is try to break down those classical tools so that we can more effectively exploit the blurry line between not-so-different object categories.  At the end of the day, rigid categories can only get us so far.  Intelligence requires interpretation at multiple and potentially disparate levels.  When it comes to intelligence, the world is not black and white, there are many flavours of meaningful image interpretation.