The only catch is that you are only given one picture, and I am free to replace the picture with a painting or a sketch. Any two-dimensional pattern is a valid query image, but the key thing to note is that there is only a single input image. Life would be awesome if Google's Picasa had this feature built in!
The classical way of solving this problem is via a brute-force nearest neighbor algorithm, an algorithm which won't match pixel pattern directly, but an algorithm which will also use a state-of-the-art image descriptor such as GIST for comparison. Back in 2007, at SIGGRAPH, James Hays and Alexei Efros have shown this to work quite well once you have a very large database of images! But the reason why the database had to be so large is because a naive Nearest Neighbor algorithm is actually quite dumb. The descriptor might be cleverer than matching raw pixel intensities, but for a machine, an image is nothing but a matrix of numbers, and nobody told the machine which patterns in the matrix are meaningful and which ones aren't. In short, the brute-force algorithm works if there are similar enough images such that all parts of the input image will match a retrieved image. But ideally we would like the algorithm to get better matches by automatically figuring out which parts of the query image are meaningful (e.g., the fountain in the painting) and which parts aren't (e.g., the reflections in the water).
A modern approach to solve this issue is to collect a large set of related "positive images" and a large set of un-related "negative images" and then train a powerful classifier which can hopefully figure out the meaningful bits of the image. But in this approach the problem is twofold. First, working with a single input image it is not clear whether standard machine learning tools will have a chance of learning anything meaningful. The second issue, a significantly worse problem, is that without a category label or tag, how are we supposed to create a negative set?!? Exemplar-SVMs to the rescue! We can use a large collection of images from the target domain (the domain we want to find matches from) as the negative set -- as long as the "negative set" contains only a small fraction of potentially related images, learning a linear SVM with a single positive still works.
Here is an excerpt from a Techcrunch article which summarizes the project concisely:
"Instead of comparing a given image head to head with other images and trying to determine a degree of similarity, they turned the problem around. They compared the target image with a great number of random images and recorded the ways in which it differed the most from them. If another image differs in similar ways, chances are it’s similar to the first image. " -- Techcrunch
Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Data-driven Visual Similarity for Cross-domain Image Matching. In SIGGRAPH ASIA, December 2011. Project Page
Here is a short listing of some articles which mention our research (thank Abhinav!).
I read the Techcrunch description a few days ago and I remember specifically noting that paragraph because it so succinctly summarizes the method. Especially the part " ...If another image differs in similar ways..." shows genuine understanding I would not expect from a random tech writer/photographer. Did someone provide this description did they actually read the paper and figure this out all by themselves?
ReplyDeleteBtw also congrats on all the press! #i_thought_this_was_cool_before_everyone_else
@Andrej: Even I was surprised on how well it was summarized. To my knowledge, no one provided with the description and the person actually read the paper and figured it out himself!
ReplyDeleteAs always, thanks for the great post! I thought the video gave an especially good high level explanation. Great work!
ReplyDeleteAbhinav put the video together for us, he did a great job.
ReplyDelete@Andrej and @Abhinav, we should hire that guy to write abstracts for us!