Tuesday, December 06, 2011

Graphics meets Big Data meets Machine Learning

We've all played Where's Waldo as children, and at least for me it was quite a fun game.  So today let's play an image-based Big Data version of Where's Waldo.  I will give you a picture, and you have to find it in a large collection of images!  This is a form of image retrieval, and this particular formulation is also commonly called "image matching."

The only catch is that you are only given one picture, and I am free to replace the picture with a painting or a sketch.  Any two-dimensional pattern is a valid query image, but the key thing to note is that there is only a single input image. Life would be awesome if Google's Picasa had this feature built in!

The classical way of solving this problem is via a brute-force nearest neighbor algorithm, an algorithm which won't match pixel pattern directly, but an algorithm which will also use a state-of-the-art image descriptor such as GIST for comparison.  Back in 2007, at SIGGRAPH, James Hays and Alexei Efros have shown this to work quite well once you have a very large database of images!  But the reason why the database had to be so large is because a naive Nearest Neighbor algorithm is actually quite dumb.  The descriptor might be cleverer than matching raw pixel intensities, but for a machine, an image is nothing but a matrix of numbers, and nobody told the machine which patterns in the matrix are meaningful and which ones aren't.  In short, the brute-force algorithm works if there are similar enough images such that all parts of the input image will match a retrieved image.  But ideally we would like the algorithm to get better matches by automatically figuring out which parts of the query image are meaningful  (e.g., the fountain in the painting) and which parts aren't (e.g., the reflections in the water).

A modern approach to solve this issue is to collect a large set of related "positive images" and a large set of un-related "negative images" and then train a powerful classifier which can hopefully figure out the meaningful bits of the image. But in this approach the problem is twofold.  First, working with a single input image it is not clear whether standard machine learning tools will have a chance of learning anything meaningful.  The second issue, a significantly worse problem, is that without a category label or tag, how are we supposed to create a negative set?!?  Exemplar-SVMs to the rescue!  We can use a large collection of images from the target domain (the domain we want to find matches from) as the negative set -- as long as the "negative set" contains only a small fraction of potentially related images, learning a linear SVM with a single positive still works.

Here is an excerpt from a Techcrunch article which summarizes the project concisely:

"Instead of comparing a given image head to head with other images and trying to determine a degree of similarity, they turned the problem around. They compared the target image with a great number of random images and recorded the ways in which it differed the most from them. If another image differs in similar ways, chances are it’s similar to the first image. " -- Techcrunch

Abhinav ShrivastavaTomasz MalisiewiczAbhinav GuptaAlexei A. EfrosData-driven Visual Similarity for Cross-domain Image Matching. In SIGGRAPH ASIA, December 2011. Project Page

Here is a short listing of some articles which mention our research (thank Abhinav!).