Tombone's Computer Vision Blog: ICCV 2013

The first generation of datasets in the computer vision community were just plain old images -- simple arrays of pixels. Seem like nothing fancy, but we must recall that there was a time where a single image could barely fit inside a computer's memory. During this early time, researchers showcased their image processing algorithms on the infamous Lenna image. But later we saw datasets like the Corel dataset, Caltech 101, LabelMe, SUN, James Hays' 6 million Flick images, PASCAL VOC, and Image Net. These were impressive collections of images and for the first time computer vision researchers did not have to collect their own images. As Jitendra Malik once said, large annotated datasets marked the end of the "Wild Wild West" in Computer Vision -- for the first time, large datasets allowed researchers to compare object recognition algorithms on the same sets of images! What is different about these datasets is that some come with annotations at the image level, some come with annotated polygons, and some come with nothing more than objects annotated at the bounding box level. Images are captured by a camera and annotations are produced by a human annotation effort. But these traditional vision datasets lack depth, 3D information, or anything of that sort. LabelMe3D was an attempt at reconstructing depth from object annotations, but it would only work in a pop-up world kind of way.

The next generation of datasets is all about going into 3D. But not just annotated depth images like the NYU2 Depth Dataset depicted the in following image:

What a 3D Environment dataset (or 3D place dataset) is all about is making 3D reconstructions the basic primitive of research. This means that an actual 3D reconstruction algorithm has to first be ran to create dataset. This is a fairly new idea in the Computer Vision community. The paper which introduces such a dataset, SUN3D, was introduced at this year's ICCV 2013 conference. I briefly outlined the paper in my ICCV 2013 summary blog post, but I felt that this topic is worthy of its own blog post. For those interested, the paper link is below:

J. Xiao, A. Owens and A. Torralba SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels Proceedings of 14th IEEE International Conference on Computer Vision (ICCV2013). paper link

Running a 3D reconstruction algorithm is no easy feat, so Xiao et al. found that some basic polygon-level annotations were sufficient for snapping Structure from Motion algorithms into place. For those of you that don't what know a Structure from Motion (SfM) algorithm is, it is a process which reconstructs the 3D locations of points inside images (the structure) as well as the camera parameters (the motion) for a sequence of images. Xiao et al.'s SfM algorithm uses the depth data from a Kinect sensor in addition to the manually provided object annotations. Checkout their video below:

Depth dataset vs SUN 3D dataset
The NYU2 Depth dataset is useful for studying object detection algorithms which operate on 2.5D images while MIT's SUN 3D dataset is useful for contextual reasoning and object-object relationships. This is important because Kinect images do not give full 3D, they merely return a 2.5D "depth image" from which certain physical relationships cannot be easily inferred.

Startups doing 3D
It is also worthwhile pointing out that Matterport, a new startup, is creating their own sensors and algorithms for helping people create their own 3D reconstructions. Check out their vide below:

What this means for the rest of us
We should expect the next generation of smartphones to have their own 3D sensors. In addition, we should expect the new generation of wearable devices such as Google Glass to give us more than 2D reasoning, they should be able to use this 3D data to make better visual inferences. I'm glad to see 3D getting more and more popular as this allows researchers to work on new problems, new data structures, and push their creativity to the next level!

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere. Detectors are not only getting faster, but getting stylish. Edges are making a comeback. HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level Categories, Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference. It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection. However this paper is not just about edges. Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees." Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project. It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above. There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space. One great example is how the appearance of cars has changed over the past several decades. Another example is how typical Google Street View images change as a function of going North-to-South in the United States. Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros. I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013. www.neil-kb.com

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper. A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about. This is machine learning, this is AI, this is the future. None of this train on my favourite dataset, test on my favourite dataset bullshit. If there's anybody that's going to do it the right way, its the CMU gang. This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning. The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier? Then this paper is for you. Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection. The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure. Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression. Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010. A recent blog post from Piotr Dollár (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data. Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation. Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues. This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes. While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs. Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos? Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds. Well, now there's an algorithm for that. Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013. sun3d.cs.princeton.edu

Xiao et al, continue their hard-core data collection efforts. Now in 3D. In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers. Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle. In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place. And not any sort of crazy, click-here, click-there interfaces. Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic. For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao. Congratulations on the new faculty position! Princeton has been starving for a person like you. If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly. Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars, ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list. For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.

Tombone's Computer Vision Blog

Friday, December 27, 2013

Understanding the Visual World, one 3D reconstruction at a time

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

Friday, December 27, 2013

Understanding the Visual World, one 3D reconstruction at a time

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

Subscribe To