Tombone's Computer Vision Blog: December 2011

Tuesday, December 13, 2011

learning to "borrow" examples for object detection. Lim et al, NIPS 2011

Let's say you want to train a cat detector... If you're anything like me, then you probably have a few labeled cats (~100), as well as a source of non-cat images (~1000). So what do you do when you can't get any more labeled cats? (Maybe Amazon's Mechanical Turk service was shut down by the feds, you've got a paper deadline in 48 hours, and money can't get you out of this dilemma.)

Answer:
1) Realize that there are some labeled dogs/cows/sheep in your dataset!
2) Transform some of the dogs/cows/sheep in your dataset to make them look more like cats. Maybe some dogs are already sufficiently similar to cats! (see cheezburger.com image below)
3) Use a subset of those transformed dogs/cows/sheep examples as additional positives in your cat detector!

Some dogs just look like cats! (and vice-versa)

Image courtesy of cheezburger.com

Using my own internal language, I view this phenomenon as "exemplar theft." But not the kind of theft which sends you to prison, 'tis the kind of theft which gives you best-paper prizes at your local conference.

Note that this was the answer provided by the vision hackers at MIT in their most recent paper, "Transfer Learning by Borrowing Examples for Multiclass Object Detection," which was just presented at this year's big machine learning-oriented NIPS conference, NIPS 2011. See the illustration from the paper below, which depicts this type of "example borrowing"-sharing for some objects in the SUN09 dataset.

The paper empirically demonstrates that instead of doing transfer learning (also known as multi-task learning) the typical way (regularizing weight vectors towards each other), it is beneficial to simply borrow a subset of (transformed) examples from a related class. Of course the problem is that we do not know apriori which categories to borrow from, nor which instances from those categories will give us a gain in object detection performance. The goal of the algorithm is to learn which categories to borrow from, and which examples to borrow. Not all dogs will help the cat detector.

Here are some examples of popular object categories, the categories from which examples are borrowed, and the categories from which examples are shared once we allow transformations to happen. Notice the improvement in AP (the higher the average precision the better) when you allow sharing.

They also looked at what happens if you want to improve a single category badass detector on one particular dataset, such as the PASCAL VOC. Note that these days just about everybody is using the one-and-only "badass detector" and trying to beat it in its own game. These are the different ways you'll hear people talk about the Latent-SVM-based Deformable Part Model baseline. "badass detector"="state-of-the-art detector"="Felzenszwalb et al. detector"="Pedro's detector"="Deva's detector","Pedro/Deva detector","LDPM detector","DPM detector"

Even if you only care about your favourite dataset, such as PASCAL VOC, you're probably willing to use additional positive data points from another dataset. In their NIPS paper, the MIT hackers show that simply concatenating datasets is inferior to their clever example borrowing algorithm (mathematical details are found in the paper, but feel free to ask me detailed questions in the comments). In the figure below, the top row shows cars from one dataset (SUN09), the middle row shows PASCAL VOC 2007 cars, and the bottom row shows which example the SUN09-car detector wants to borrow from PASCAL VOC.

Here the the cross-dataset generalization performance on the SUN09/PASCAL duo. These results were inspired by the dataset bias work of Torralba and Efros.

In case you're interested, here is the full citation for this excellent NIPS2011 paper:

Joseph J. Lim, Ruslan Salakhutdinov, and Antonio Torralba. "Transfer Learning by Borrowing Examples for Multiclass Object Detection," in NIPS 2011. [pdf]

To get a better understanding of Lim et al's paper, it is worthwhile going back in time to CVPR2011 and taking a quick look the following paper, also from MIT:

Ruslan Salakhutdinov, Antonio Torralba, Josh Tenenbaum. "Learning to Share Visual Appearance for Multiclass Object Detection," in CVPR 2011. [pdf]

Of course, these authors need no introduction (they are all professors at big-time institutions). Ruslan just recently became a Professor and is now back on home turf (where he got his PhD) in Toronto, where he is likely to become the next Hinton. In my opinion, this "Learning to share" paper was one of the best papers of CVPR 2011. In this paper they introduced the idea of sharing across rigid classifier templates, and more importantly learning a tree to organize hundreds of object categories. The tree defines how the sharing is supposed to happen. The root note is global and shared across all categories, the mid-level nodes can be interpreted as super-categories (i.e., animal, vehicle), and the leaves are the actual object categories (e.g., dog, chair, person, truck).

The coolest thing about the paper is that they use a CRP (chinese restaurant process) to learn a tree without having to specify the number of super-categories!

Finally, we can see some learned weights for three distinct object categories: truck, van, and bucket. Please see the paper if you want to learn more about sharing -- the clarity of Ruslan's paper is exceptional.

In conclusion, it is pretty clear everybody wants some sort of visual memex. (It is easy to think of the visual memex as a graph where the nodes are individual instances and the edges are relationships between these entities) Sharing, borrowing, multi-task regularization, exemplar-svms, and a host of other approaches are hinting at the breakdown of the traditional category-based way of approaching the problem of object recognition. However, our machine learning tools were designed for supervised machine learning with explicit class information. So what we, the researchers do, is try to break down those classical tools so that we can more effectively exploit the blurry line between not-so-different object categories. At the end of the day, rigid categories can only get us so far. Intelligence requires interpretation at multiple and potentially disparate levels. When it comes to intelligence, the world is not black and white, there are many flavours of meaningful image interpretation.

Tuesday, December 06, 2011

Graphics meets Big Data meets Machine Learning

We've all played Where's Waldo as children, and at least for me it was quite a fun game. So today let's play an image-based Big Data version of Where's Waldo. I will give you a picture, and you have to find it in a large collection of images! This is a form of image retrieval, and this particular formulation is also commonly called "image matching."

The only catch is that you are only given one picture, and I am free to replace the picture with a painting or a sketch. Any two-dimensional pattern is a valid query image, but the key thing to note is that there is only a single input image. Life would be awesome if Google's Picasa had this feature built in!

The classical way of solving this problem is via a brute-force nearest neighbor algorithm, an algorithm which won't match pixel pattern directly, but an algorithm which will also use a state-of-the-art image descriptor such as GIST for comparison. Back in 2007, at SIGGRAPH, James Hays and Alexei Efros have shown this to work quite well once you have a very large database of images! But the reason why the database had to be so large is because a naive Nearest Neighbor algorithm is actually quite dumb. The descriptor might be cleverer than matching raw pixel intensities, but for a machine, an image is nothing but a matrix of numbers, and nobody told the machine which patterns in the matrix are meaningful and which ones aren't. In short, the brute-force algorithm works if there are similar enough images such that all parts of the input image will match a retrieved image. But ideally we would like the algorithm to get better matches by automatically figuring out which parts of the query image are meaningful (e.g., the fountain in the painting) and which parts aren't (e.g., the reflections in the water).

A modern approach to solve this issue is to collect a large set of related "positive images" and a large set of un-related "negative images" and then train a powerful classifier which can hopefully figure out the meaningful bits of the image. But in this approach the problem is twofold. First, working with a single input image it is not clear whether standard machine learning tools will have a chance of learning anything meaningful. The second issue, a significantly worse problem, is that without a category label or tag, how are we supposed to create a negative set?!? Exemplar-SVMs to the rescue! We can use a large collection of images from the target domain (the domain we want to find matches from) as the negative set -- as long as the "negative set" contains only a small fraction of potentially related images, learning a linear SVM with a single positive still works.

Here is an excerpt from a Techcrunch article which summarizes the project concisely:

"Instead of comparing a given image head to head with other images and trying to determine a degree of similarity, they turned the problem around. They compared the target image with a great number of random images and recorded the ways in which it differed the most from them. If another image differs in similar ways, chances are it’s similar to the first image. " -- Techcrunch

Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Data-driven Visual Similarity for Cross-domain Image Matching. In SIGGRAPH ASIA, December 2011. Project Page

Here is a short listing of some articles which mention our research (thank Abhinav!).

http://techcrunch.com/2011/12/06/cmu-researchers-one-up-google-image-search-and-photosynth-with-visual-similarity-engine/

http://news.cs.cmu.edu/article.php?a=2858

http://futureoftech.msnbc.msn.com/_news/2011/12/06/9252228-computer-mimics-human-ability-to-match-images

http://www.physorg.com/news/2011-12-team-computerized-method-images-photos.html

http://nanopatentsandinnovations.blogspot.com/2011/12/carnegie-mellon-creates-computerized.html

http://www.sciencecodex.com/read/carnegie_mellon_creates_computerized_method_for_matching_images_in_photos_paintings_sketches-82595

http://www.cra.org/ccc/rh-imatch.php

Monday, December 05, 2011

An accidental face detector

Disclaimer #1: I don't specialize in faces. When it comes to learning, I like my objectives to be convex. When it comes to hacking on vision systems, I like to tackle entry-level object categories.

Fun fact #1: Faces are probably the easiest objects in the world for a machine to localize/detect/recognize.

Note #1: I supplied the images, my algorithm supplied the red boxes.

Note #2: Sorry to all my friends who failed to get detected by my accidental face detector! (see below)

So I was hackplaying with some of my PhD thesis code over Thanksgiving, and I accidentally made a face detector. oops! I immediately ran to my screenshot capture tool and ran my code on my Mac desktop while browsing Google Images and Facebook. It seems to work pretty well on real faces as well as sketches/paintings of faces (see below)! I even caught two Berkeleyites (an Alyosha and a Jianbo), but you gotta find them for yourself. The detector is definitely tuned to frontal faces, but runs pretty fast and produces few false positives. Not too shabby for some midnight hackerdom.

Yes, I'm doing dense multiscale sliding windows here. Yes, I'm HoGGing the hell outta these images. Yes, I'm using a single frontal-face tuned template. And yes, I only used faces of myself to train this accidental face detector.

Note: If I've used one of your pictures without permission, and you would like a link back to your home on the interwebs, please leave a comment indicating the image and link to original.

Friday, December 02, 2011

Google Scholar, My Citations, a new paradigm for finding great Computer Vision research papers

I have been finding great computer vision research papers by using Google Scholar for the past 2+ years. My recipe is straightforward and has two key ingredients. First, by finding new papers that cite one of my published papers, I automatically get to read papers which will be relevant to my own research interests. The best bit is that by using Google Scholar, I'm not limiting my search to a single conference -- Google finds papers from the raw web.

Second, I have a short list of superstar vision researchers (Jitendra Malik, among others) and I basically read anything and everything these gurus publish. Regularly visiting academic homepages is the best way to do this, but Google Scholar also lets me search by name. In addition, nobody lists on their homepage their papers' citation counts. This means if I visit a researcher's personal website, I have to make a decision as to what paper to read based on (title, co-authors, publication venue). But highly-cited papers are likely to be more important to read first. I believe that this is a good rule of thumb, and very important if you are new to the field.

I am really glad that Google finally let researchers make public profiles to view their papers and see their citations, etc. See Google Scholar blog for more information. I've been using statcounter to monitor my blog's visitors, and now I can use Google Scholar to monitor who is citing my research papers! I'm not claiming that the only way for me to read one of your papers is to cite one of my papers, but believe me, even if we never met at a vision conference, if you cited one one my papers there's a good chance I already know about your research :-) I would love to see Google Scholar Citations pages one day replace "my publications" sections on academic homepages...

My Citations screenshot

My only complaint with Google Scholar is that I can't seem to get it to recognize my two most recent papers. I have these papers listed on my homepage, so do my co-authors, but Google isn't picking them up!!! I manually added them to my Google My Citations page, and using Google Scholar I was able to find at least one other paper which cites on of these two papers.

I read the inclusion guidelines, and I'm still baffled. The PDFs are definitely over 5MB, but my older papers which were indexed by Google were also over 5MB. Dear Google, are you seriously not indexing my recent papers because they are over 5MB? It takes us, researchers, months of hard work to get our work out the door. We see the sun rise for weeks straight when we are in deadline-mode, and the conferences/journals give us size limitations -- we work hard to make our stuff fit within these limits (something like 20MB per PDF). And we, researchers, are crazy about Google and what it means for organizing the world's information -- naturally we are jumping on the Google Scholar bandwagon. I really hope there's some silly reason why I can't find my own papers using Google Scholar, but if I can't find my own work, that means others can't find my own work, and until I can be confident that Google Scholar is bug-free, I cannot give it my full recommendation.

Problematic papers for Google Scholar:

Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Data-driven Visual Similarity for Cross-domain Image Matching. In SIGGRAPH ASIA, December 2011.

Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Ensemble of Exemplar-SVMs for Object Detection and Beyond. In ICCV, November 2011.

If anybody has any suggestions (there's a chance I'm doing something wrong), or an explanation as to why my papers haven't been indexed, I would love to hear from you.