Tuesday, December 13, 2011

learning to "borrow" examples for object detection. Lim et al, NIPS 2011

Let's say you want to train a cat detector...  If you're anything like me, then you probably have a few labeled cats (~100), as well as a source of non-cat images (~1000).  So what do you do when you can't get any more labeled cats?  (Maybe Amazon's Mechanical Turk service was shut down by the feds, you've got a paper deadline in 48 hours, and money can't get you out of this dilemma.)

Answer: 
1) Realize that there are some labeled dogs/cows/sheep in your dataset!
2) Transform some of the dogs/cows/sheep in your dataset to make them look more like cats. Maybe some dogs are already sufficiently similar to cats! (see cheezburger.com image below)
3) Use a subset of those transformed dogs/cows/sheep examples as additional positives in your cat detector!

Some dogs just look like cats! (and vice-versa)


Using my own internal language, I view this phenomenon as "exemplar theft."  But not the kind of theft which sends you to prison, 'tis the kind of theft which gives you best-paper prizes at your local conference.

Note that this was the answer provided by the vision hackers at MIT in their most recent paper, "Transfer Learning by Borrowing Examples for Multiclass Object Detection," which was just presented at this year's big machine learning-oriented NIPS conference, NIPS 2011. See the illustration from the paper below, which depicts this type of "example borrowing"-sharing for some objects in the SUN09 dataset.


The paper empirically demonstrates that instead of doing transfer learning (also known as multi-task learning) the typical way (regularizing weight vectors towards each other), it is beneficial to simply borrow a subset of (transformed) examples from a related class.  Of course the problem is that we do not know apriori which categories to borrow from, nor which instances from those categories will give us a gain in object detection performance.  The goal of the algorithm is to learn which categories to borrow from, and which examples to borrow.  Not all dogs will help the cat detector.

Here are some examples of popular object categories, the categories from which examples are borrowed, and the categories from which examples are shared once we allow transformations to happen.  Notice the improvement in AP (the higher the average precision the better) when you allow sharing.



They also looked at what happens if you want to improve a single category badass detector on one particular dataset, such as the PASCAL VOC.  Note that these days just about everybody is using the one-and-only "badass detector" and trying to beat it in its own game.   These are the different ways you'll hear people talk about the Latent-SVM-based Deformable Part Model baseline. "badass detector"="state-of-the-art detector"="Felzenszwalb et al. detector"="Pedro's detector"="Deva's detector","Pedro/Deva detector","LDPM detector","DPM detector"

Even if you only care about your favourite dataset, such as PASCAL VOC, you're probably willing to use additional positive data points from another dataset.  In their NIPS paper, the MIT hackers show that simply concatenating datasets is inferior to their clever example borrowing algorithm (mathematical details are found in the paper, but feel free to ask me detailed questions in the comments).  In the figure below, the top row shows cars from one dataset (SUN09), the middle row shows PASCAL VOC 2007 cars, and the bottom row shows which example the SUN09-car detector wants to borrow from PASCAL VOC.

Here the the cross-dataset generalization performance on the SUN09/PASCAL duo.  These results were inspired by the dataset bias work of Torralba and Efros.



In case you're interested, here is the full citation for this excellent NIPS2011 paper:

Joseph J. Lim, Ruslan Salakhutdinov, and Antonio Torralba. "Transfer Learning by Borrowing Examples for Multiclass Object Detection," in NIPS 2011. [pdf]




To get a better understanding of Lim et al's paper, it is worthwhile going back in time to CVPR2011 and taking a quick look the following paper, also from MIT:

Ruslan Salakhutdinov, Antonio Torralba, Josh Tenenbaum. "Learning to Share Visual Appearance for Multiclass Object Detection," in CVPR 2011. [pdf]

Of course, these authors need no introduction (they are all professors at big-time institutions). Ruslan just recently became a Professor and is now back on home turf (where he got his PhD) in Toronto, where he is likely to become the next Hinton.  In my opinion, this "Learning to share" paper was one of the best papers of CVPR 2011.  In this paper they introduced the idea of sharing across rigid classifier templates, and more importantly learning a tree to organize hundreds of object categories.  The tree defines how the sharing is supposed to happen.  The root note is global and shared across all categories, the mid-level nodes can be interpreted as super-categories (i.e., animal, vehicle), and the leaves are the actual object categories (e.g., dog, chair, person, truck).

The coolest thing about the paper is that they use a CRP (chinese restaurant process) to learn a tree without having to specify the number of super-categories!

Finally, we can see some learned weights for three distinct object categories: truck, van, and bucket.  Please see the paper if you want to learn more about sharing -- the clarity of Ruslan's paper is exceptional.




In conclusion, it is pretty clear everybody wants some sort of visual memex. (It is easy to think of the visual memex as a graph where the nodes are individual instances and the edges are relationships between these entities)  Sharing, borrowing, multi-task regularization, exemplar-svms, and a host of other approaches are hinting at the breakdown of the traditional category-based way of approaching the problem of object recognition.  However, our machine learning tools were designed for supervised machine learning with explicit class information.  So what we, the researchers do, is try to break down those classical tools so that we can more effectively exploit the blurry line between not-so-different object categories.  At the end of the day, rigid categories can only get us so far.  Intelligence requires interpretation at multiple and potentially disparate levels.  When it comes to intelligence, the world is not black and white, there are many flavours of meaningful image interpretation.

6 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi Tomasz, I want to ask u a basic problem.

    If I want to train a cat detector, how can I get non-cat images to cover so many different non-cat labels? It is a confusing problem for me.

    For example, dog is a non-cat label, car also a non-cat label, television, cup, table, cellphone... all also non-cat labels. How do I collect all images from all these labels? I think they are infinite. But if I only get some non-cat images, say 1000 images, to train the detector, how do the detector measure the difference between cat and the label not covered by the 1000 images?

    There is another related problem. Why do we not use the method only model cat to detect cat, like anomaly detection? Is it maybe the difference between discriminative model and generative model?

    ReplyDelete
  3. @ loveisp

    One will never be able to collect enough images to cover all of the world's visual concepts. In practice, dealing with thousands of images for negative data is sufficient. Remember that in the detection problem, the goal is to localize objects within images? What this means is that if I give you a single image and tell you that it doesn't contain any cats, any subwindow in that image can be treated as a non-cat. Most of those subwindows will be non-object patches. This means that a single negative image gives rises to ~20,000 negative data points.

    Generative methods have a longer history than discriminative methods, and there are many popular generative methods around. However, it is still a matter of research to tell which method (or combination of) will prevail.

    ReplyDelete
    Replies
    1. loveisp12:14 PM

      Thank you for your reply. I am blocked by GFW, so I can't read your blog often. What you say makes sense. I think I should make some experiments on this topic. Thanks again.

      Delete
  4. like your blog and all the papers you reviewed. Good job!

    ReplyDelete
  5. dear sir,
    iam working on my senior project
    my project is about car detection using matlab ( video processing)
    iam not sure if you can help me, but i face some problems with the code, iam asking for your help pleas
    best regards

    ReplyDelete