Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere.  Detectors are not only getting faster, but getting stylish.  Edges are making a comeback.  HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level CategoriesVicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference.  It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection.  However this paper is not just about edges.  Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees."  Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project.  It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above.  There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and TimeYong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space.  One great example is how the appearance of cars has changed over the past several decades.  Another example is how typical Google Street View images change as a function of going North-to-South in the United States.  Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros.  I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013.

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper.  A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about.  This is machine learning, this is AI, this is the future.  None of this train on my favourite dataset, test on my favourite dataset bullshit.  If there's anybody that's going to do it the right way, its the CMU gang.  This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning.  The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier?  Then this paper is for you.  Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection.  The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure.  Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression.  Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010.  A recent blog post from Piotr Doll├ír (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data.  Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation.  Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues.  This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes.  While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs.  Aditya KhoslaWilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos?  Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds.  Well, now there's an algorithm for that.  Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013.

Xiao et al, continue their hard-core data collection efforts.  Now in 3D.  In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers.  Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle.  In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place.  And not any sort of crazy, click-here, click-there interfaces.  Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic.  For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao.  Congratulations on the new faculty position!  Princeton has been starving for a person like you.  If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly.  Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars,  ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list.  For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.