Tombone's Computer Vision Blog: papers

Showing posts with label papers. Show all posts

Friday, June 17, 2016

Making Deep Networks Probabilistic via Test-time Dropout

In Quantum Mechanics, Heisenberg's Uncertainty Principle states that there is a fundamental limit to how well one can measure a particle's position and momentum. In the context of machine learning systems, a similar principle has emerged, but relating interpretability and performance. By using a manually wired or shallow machine learning model, you'll have no problem understanding the moving pieces, but you will seldom be happy with the results. Or you can use a black-box deep neural network and enjoy the model's exceptional performance. Today we'll see one simple and effective trick to make our deep black boxes a bit more intelligible. The trick allows us to convert neural network outputs into probabilities, with no cost to performance, and minimal computational overhead.

Interpretability vs Performance: Deep Neural Networks perform well on most computer vision tasks, yet they are notoriously difficult to interpret.

The desire to understand deep neural networks has triggered a flurry of research into Neural Network Visualization, but in practice we are often forced to treat deep learning systems as black-boxes. (See my recent Deep Learning Trends @ ICLR 2016 post for an overview of recent neural network visualization techniques.) But just because we can't grok the inner-workings of our favorite deep models, it doesn't mean we can't ask more out of our deep learning systems.

There exists a simple trick for upgrading black-box neural network outputs into probability distributions.

The probabilistic approach provides confidences, or "uncertainty" measures, alongside predictions and can make almost any deep learning systems into a smarter one. For robotic applications or any kind of software that must make decisions based on the output of a deep learning system, being able to provide meaningful uncertainties is a true game-changer.

Applying Dropout to your Deep Neural Network is like occasionally zapping your brain

The key ingredient is dropout, an anti-overfitting deep learning trick handed down from Hinton himself (Krizhevsky's pioneering 2012 paper). Dropout sets some of the weights to zero during training, reducing feature co-adaptation, thus improving generalization.

Without dropout, it is too easy to make a moderately deep network attain 100% accuracy on the training set.

The accepted knowledge is that an un-regularized network (one without dropout) is too good at memorizing the training set. For a great introductory machine learning video lecture on dropout, I highly recommend you watch Hugo Larochelle's lecture on Dropout for Deep learning.

Geoff Hinton's dropout lecture, also a great introduction, focuses on interpreting dropout as an ensemble method. If you're looking for new research ideas in the dropout space, a thorough understanding of Hinton's interpretation is a must.

But while dropout is typically used at training-time, today we'll highlight the keen observation that dropout used at test-time is one of the simplest ways to turn raw neural network outputs into probability distributions. Not only does this probabilistic "free upgrade" often improve classification results, it provides a meaningful notion of uncertainty, something typically missing in Deep Learning systems.

The idea is quite simple: to estimate the predictive mean and predictive uncertainty, simply collect the results of stochastic forward passes through the model using dropout.

How to use dropout: 2016 edition

Start with a moderately sized network
Increase your network size with dropout turned off until you perfectly fit your data
Then, train with dropout turned on
At test-time, turn on dropout and run the network T times to get T samples
The mean of the samples is your output and the variance is your measure of uncertainty

Remember that drawing more samples will increase computation time during testing unless you're clever about re-using partial computations in the network. Please note that if you're only using dropout near the end of your network, you can reuse most of the computations. If you're not happy with the uncertainty estimates, consider adding more layers of dropout at test-time. Since you'll already have a pre-trained network, experimenting with test-time dropout layers is easy.

Bayesian Convolutional Neural Networks

To be truly Bayesian about a deep network's parameters, we wouldn't learn a single set of parameters w, we would infer a distribution over weights given the data, p(w|X,Y). Training is already quite expensive, requiring large datasets and expensive GPUs.

Bayesian learning algorithms can in theory provide much better parameter estimates for ConvNets and I'm sure some of our friends at Google are working on this already.

But today we aren't going to talk about such full Bayesian Deep Learning systems, only systems that "upgrade" the model prediction y to p(y|x,w). In other words, only the network outputs gain a probabilistic interpretation.

An excellent deep learning computer vision system which uses test-time dropout comes from a recent University of Cambridge technique called SegNet. The SegNet approach introduced an Encoder-Decoder framework for dense semantic segmentation. More recently, SegNet includes a Bayesian extension that uses dropout at test-time for providing uncertainty estimates. Because the system provides a dense per-pixel labeling, the confidences can be visualized as per-pixel heatmaps. Segmentation system is not performing well? Just look at the confidence heatmaps!

Bayesian SegNet. A fully convolutional neural network architecture which provides

per-pixel class uncertainty estimates using dropout.

The Bayesian SegNet authors tested different strategies for dropout placement and determined that a handful of dropout layers near the encoder-decoder bottleneck is better than simply using dropout near the output layer. Interestingly, Bayesian SegNet improves the accuracy over vanilla SegNet. Their confidence maps shown high uncertainty near object boundaries, but different test-time dropout schemes could provide a more diverse set of uncertainty estimates.

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding Alex Kendall, Vijay Badrinarayanan, Roberto Cipolla, in arXiv:1511.02680, November 2015. [project page with videos]

Confidences are quite useful for evaluation purposes, because instead of providing a single average result across all pixels in all images, we can sort the pixels and/or images by the overall confidence in prediction. When evaluation the top 10% most confident pixels, we should expect significantly higher performance. For example, the Bayesian SegNet approach achieves 75.4% global accuracy on the SUN RGBD dataset, and an astonishing 97.6% on most confident 10% of the test-set [personal communication with Bayesian SegNet authors]. This kind of sort-by-confidence evaluation was popularized by the PASCAL VOC Object Detection Challenge, where precision/recall curves were the norm. Unfortunately, as the research community moved towards large-scale classification, the notion of confidence was pushed aside. Until now.

Theoretical Bayesian Deep Learning

Deep networks that model uncertainty are truly meaningful machine learning systems. It ends up that we don't really have to understand how a deep network's neurons process image features to trust the system to make decisions. As long as the model provides uncertainty estimates, we'll know when the model is struggling. This is particularly important when your network is given inputs that are far from the training data.

The Gaussian Process: A machine learning approach with built-in uncertainty modeling

In a recent ICML 2016 paper, Yarin Gal and Zoubin Ghahramani develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Gal's paper gives a complete theoretical treatment of the link between Gaussian processes and dropout, and develops the tools necessary to represent uncertainty in deep learning. They show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process. I have yet to see researchers use dropout between every layer, so the discrepancy between theory and practice suggests that more research is necessary.

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning Yarin Gal, Zoubin Ghahramani, in ICML. June 2016. [Appendix with relationship to Gaussian Processes]
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks Yarin Gal, in arXiv:1512.05287. May 2016.

What My Deep Model Doesn't Know. Yarin Gal. Blog Post. July 2015

Homoscedastic and Heteroscedastic Regression with Dropout Uncertainty. Yarin Gal. Blog Post. February 2016.

Test-time dropout is used to provide uncertainty estimates for deep learning systems.

In conclusion, maybe we can never get both interpretability and performance when it comes to deep learning systems. But, we can all agree that providing confidences, or uncertainty estimates, alongside predictions is always a good idea. Dropout, the very single regularization trick used to battle overfitting in deep models, shows up, yet again. Sometimes all you need is to add some random variations to your input, and average the results over many trials. Dropout lets you not only wiggle the network inputs but the entire architecture.

I do wonder what Yann LeCun thinks about Bayesian ConvNets... Last I heard, he was allergic to sampling.

Related Posts
Deep Learning vs Probabilistic Graphical Models vs Logic April 2015
Deep Learning Trends @ ICLR 2016 June 2016

Tuesday, December 08, 2015

ICCV 2015: Twenty one hottest research papers

"Geometry vs Recognition" becomes ConvNet-for-X

Computer Vision used to be cleanly separated into two schools: geometry and recognition. Geometric methods like structure from motion and optical flow usually focus on measuring objective real-world quantities like 3D "real-world" distances directly from images and recognition techniques like support vector machines and probabilistic graphical models traditionally focus on perceiving high-level semantic information (i.e., is this a dog or a table) directly from images.

The world of computer vision ~~is changing fast~~ has changed. We now have powerful convolutional neural networks that are able to extract just about anything directly from images. So if your input is an image (or set of images), then there's probably a ConvNet for your problem. While you do need a large labeled dataset, believe me when I say that collecting a large dataset is much easier than manually tweaking knobs inside your 100K-line codebase. As we're about to see, the separation between geometric methods and learning-based methods is no longer easily discernible.

By 2016 just about everybody in the computer vision community will have tasted the power of ConvNets, so let's take a look at some of the hottest new research directions in computer vision.

ICCV 2015's Twenty One Hottest Research Papers

This December in Santiago, Chile, the International Conference of Computer Vision 2015 is going to bring together the world's leading researchers in Computer Vision, Machine Learning, and Computer Graphics.

To no surprise, this year's ICCV is filled with lots of ConvNets, but this time the applications of these Deep Learning tools are being applied to much much more creative tasks. Let's take a look at the following twenty one ICCV 2015 research papers, which will hopefully give you a taste of where the field is going.

1. Ask Your Neurons: A Neural-Based Approach to Answering Questions About Images Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

"We propose a novel approach based on recurrent neural networks for the challenging task of answering of questions about images. It combines a CNN with a LSTM into an end-to-end architecture that predict answers conditioning on a question and an image."

2. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

"To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book."

3. Learning to See by Moving Pulkit Agrawal, Joao Carreira, Jitendra Malik

"We show that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt using class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching."

4. Local Convolutional Features With Unsupervised Training for Image Retrieval Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, Cordelia Schmid

"We introduce a deep convolutional architecture that yields patch-level descriptors, as an alternative to the popular SIFT descriptor for image retrieval."

5. Deep Networks for Image Super-Resolution With Sparse Prior Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, Thomas Huang

"We show that a sparse coding model particularly designed for super-resolution can be incarnated as a neural network, and trained in a cascaded structure from end to end."

6. High-for-Low and Low-for-High: Efficient Boundary Detection From Deep Object Features and its Applications to High-Level Vision Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

"In this work we show how to predict boundaries by exploiting object level features from a pretrained object-classification network."

7. A Deep Visual Correspondence Embedding Model for Stereo Matching Costs Zhuoyuan Chen, Xun Sun, Liang Wang, Yinan Yu, Chang Huang

"A novel deep visual correspondence embedding model is trained via Convolutional Neural Network on a large set of stereo images with ground truth disparities. This deep embedding model leverages appearance data to learn visual similarity relationships between corresponding image patches, and explicitly maps intensity values into an embedding feature space to measure pixel dissimilarities."

8. Im2Calories: Towards an Automated Mobile Vision Food Diary Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, Kevin P. Murphy

"We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories."

9. Unsupervised Visual Representation Learning by Context Prediction Carl Doersch, Abhinav Gupta, Alexei A. Efros

"How can one write an objective function to encourage a representation to capture, for example, objects, if none of the objects are labeled?"

10. Deep Neural Decision Forests Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, Samuel Rota Bulò

"We introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network."

11. Conditional Random Fields as Recurrent Neural Networks Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, Philip H. S. Torr

"We formulate mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks."

12. Flowing ConvNets for Human Pose Estimation in Videos Tomas Pfister, James Charles, Andrew Zisserman

"We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow."

13. Dense Optical Flow Prediction From a Static Image Jacob Walker, Abhinav Gupta, Martial Hebert

"Given a static image, P-CNN predicts the future motion of each and every pixel in the image in terms of optical flow. Our P-CNN model leverages the data in tens of thousands of realistic videos to train our model. Our method relies on absolutely no human labeling and is able to predict motion based on the context of the scene."

14. DeepBox: Learning Objectness With Convolutional Networks Weicheng Kuo, Bharath Hariharan, Jitendra Malik

"Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method."

15. Active Object Localization With Deep Reinforcement Learning Juan C. Caicedo, Svetlana Lazebnik

"This agent learns to deform a bounding box using simple transformation actions, with the goal of determining the most specific location of target objects following top-down reasoning."

16. Predicting Depth, Surface Normals and Semantic Labels With a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

"We address three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling."

17. HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu

"We introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers."

18. FlowNet: Learning Optical Flow With Convolutional Networks Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox

"We construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task."

19. Understanding Deep Features With Computer-Generated Imagery Mathieu Aubry, Bryan C. Russell

"Rendered images are presented to a trained CNN and responses for different layers are studied with respect to the input scene factors."

20. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization Alex Kendall, Matthew Grimes, Roberto Cipolla

"Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation."

21. Visual Tracking With Fully Convolutional Networks Lijun Wang, Wanli Ouyang, Xiaogang Wang, Huchuan Lu

"A new approach for general object tracking with fully convolutional neural network."

Conclusion

While some can argue that the great convergence upon ConvNets is making the field less diverse, it is actually making the techniques easier to comprehend. It is easier to "borrow breakthrough thinking" from one research direction when the core computations are cast in the language of ConvNets. Using ConvNets, properly trained (and motivated!) 21 year old graduate student are actually able to compete on benchmarks, where previously it would take an entire 6-year PhD cycle to compete on a non-trivial benchmark.

See you next week in Chile!

Update (January 13th, 2016)

The following awards were given at ICCV 2015.

Achievement awards

PAMI Distinguished Researcher Award (1): Yann LeCun
PAMI Distinguished Researcher Award (2): David Lowe
PAMI Everingham Prize Winner (1): Andrea Vedaldi for VLFeat
PAMI Everingham Prize Winner (2): Daniel Scharstein and Rick Szeliski for the Middlebury Datasets

Paper awards

PAMI Helmholtz Prize (1): David Martin, Charles Fowlkes, Doron Tal, and Jitendra Malik for their ICCV 2001 paper "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics".
PAMI Helmholtz Prize (2): Serge Belongie, Jitendra Malik, and Jan Puzicha, for their ICCV 2001 paper "Matching Shapes".
Marr Prize: Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samual Rota Bulo, for "Deep Neural Decision Forests".
Marr Prize honorable mention: Saining Xie and Zhuowen Tu for"Holistically-Nested Edge Detection".

For more information about awards, see Sebastian Nowozin's ICCV-day-2 blog post.

I also wrote another ICCV-related blog post (January 13, 2016) about the Future of Real-Time SLAM.

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere. Detectors are not only getting faster, but getting stylish. Edges are making a comeback. HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level Categories, Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference. It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection. However this paper is not just about edges. Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees." Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project. It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above. There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space. One great example is how the appearance of cars has changed over the past several decades. Another example is how typical Google Street View images change as a function of going North-to-South in the United States. Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros. I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013. www.neil-kb.com

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper. A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about. This is machine learning, this is AI, this is the future. None of this train on my favourite dataset, test on my favourite dataset bullshit. If there's anybody that's going to do it the right way, its the CMU gang. This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning. The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier? Then this paper is for you. Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection. The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure. Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression. Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010. A recent blog post from Piotr Dollár (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data. Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation. Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues. This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes. While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs. Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos? Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds. Well, now there's an algorithm for that. Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013. sun3d.cs.princeton.edu

Xiao et al, continue their hard-core data collection efforts. Now in 3D. In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers. Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle. In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place. And not any sort of crazy, click-here, click-there interfaces. Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic. For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao. Congratulations on the new faculty position! Princeton has been starving for a person like you. If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly. Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars, ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list. For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.

Wednesday, July 03, 2013

[CVPR 2013] Three Trending Computer Vision Research Areas

As I walked through the large poster-filled hall at CVPR 2013, I asked myself, “Quo vadis Computer Vision?" (Where are you going, computer vision?) I see lots of papers which exploit last year’s ideas, copious amounts of incremental research, and an overabundance of off-the-shelf computational techniques being recombined in seemingly novel ways. When you are active in computer vision research for several years, it is not rare to find oneself becoming bored by a significant fraction of papers at research conferences. Right after the main CVPR conference, I felt mentally drained and needed to get a breath of fresh air, so I spent several days checking out the sights in Oregon. Here is one picture -- proof that the CVPR2013 had more to offer than ideas!

When I returned from sight-seeing, I took a more circumspect look at the field of computer vision. I immediately noticed that vision research is actually advancing and growing in a healthy way. (Unfortunately, most junior students have a hard determining which research papers are actually novel and/or significant.) A handful of new research themes arise each year, and today I’d like to briefly discuss three new computer vision research themes which are likely to rise in popularity in the foreseeable future (2-5 years).

1) RGB-D input data is trending.

Many of this year’s papers take a single 2.5D RGB-D image as input and try to parse the image into its constituent objects. The number of papers doing this with RGBD data is seemingly infinite. Some other CVPR 2013 approaches don’t try to parse the image, but instead do something else like: fit cuboids, reason about affordances in 3D, or reason about illumination. The reason why such inputs are becoming more popular is simple: RGB-D images can be obtained via cheap and readily available sensors such as Microsoft’s Kinect. Depth measurements used to be obtained by expensive time of flight sensors (in the late 90s and early 00s), but as of 2013, $150 can buy you one these depth sensing bad-boys! In fact, I had bought a Kinect just because I thought that it might come in handy one day -- and since I’ve joined MIT, I’ve been delving into the RGB-D reconstruction domain on my own. It is just a matter of time until the newest iPhone has an on-board depth sensor, so the current line of research which relies on RGB-D input is likely to become the norm within a few years.

H. Jiang and J. Xiao. A Linear Approach to Matching Cuboids in RGBD Images. In CVPR 2013. [pdf] [code]

2) Mid-level patch discovery is a hot research topic. Saurabh Singh from CMU introduced this idea in his seminal ECCV 2012 paper, and Carl Doersch applied this idea to large-scale Google Street-View imagery in the “What makes Paris look like Paris?” SIGGRAPH 2012 paper. The idea is to automatically extract mid-level patches (which could be objects, object parts, or just chunks of stuff) from images with the constraint that those are the most informative patches. Regarding the SIGGRAPH paper, see the video below.

Unsupervised Discovery of Mid-Level Discriminative Patches Saurabh Singh, Abhinav Gupta, Alexei A. Efros. In ECCV, 2012.

Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What Makes Paris Look like Paris? In SIGGRAPH 2012. [pdf]

At CVPR 2013, it was evident that the idea of "learning mid-level parts for scenes" is being pursued by other top-tier computer vision research groups. Here are some CVPR 2013 papers which capitalize on this idea:

Blocks that Shout: Distinctive Parts for Scene Classification. Mayank Juneja, Andrea Vedaldi, CV Jawahar, Andrew Zisserman. In CVPR, 2013. [pdf]

Representing Videos using Mid-level Discriminative Patches. Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry Davis. CVPR, 2013. [pdf]

Part Discovery from Partial Correspondence. Subhransu Maji, Gregory Shakhnarovich. In CVPR, 2013. [pdf]

3) Deep-learning and feature learning are on the rise within the Computer Vision community.
It seems that everybody at Google Research is working on Deep-learning. Will it solve all vision problems? Is it the one computational ring to rule them all? Personally, I doubt it, but the rising presence of deep learning is forcing every researcher to brush up on their l33t backprop skillz. In other words, if you don't know who Geoff Hinton is, then you are in trouble.

Wednesday, June 26, 2013

[Awesome@CVPR2013] Scene-SIRFs, Sketch Tokens, Detecting 100,000 object classes, and more

I promised to blog about some more exciting papers at CVPR 2013, so here is a short list of a few papers which stood out. This list also include this year's award winning paper: Fast, Accurate Detection of 100,000 Object Classes on a Single Machine. Congrats Google Research on the excellent paper!

This paper uses ideas from Abhinav Gupta's work on 3D scene understanding as well as Ali Farhadi's work on visual phrases; however, it also uses RGB-D input data (like many other CVPR 2013 papers).

W. Choi, Y. -W. Chao, C. Pantofaru, S. Savarese. "Understanding Indoor Scenes Using 3D Geometric Phrases" in CVPR, 2013. [pdf]

This paper shows a uses the crowd to learn which parts of birds are useful for fine-grained categorization. If you work on fine-grained categorization or run experiments with MTurk, then you gotta check this out!

Fine-Grained Crowdsourcing for Fine-Grained Recognition. Jia Deng, Jonathan Krause, Li Fei-Fei. CVPR, 2013. [ pdf ]

This paper won the best paper award. Congrats Google Research!

Fast, Accurate Detection of 100,000 Object Classes on a Single Machine. Thomas Dean, Mark Ruzon, Mark Segal, Jon Shlens, Sudheendra Vijayanarasimhan, Jay Yagnik. CVPR, 2013 [pdf]

The following is the Scene-SIRFs paper, which I thought was one of the best papers at this year's CVPR. The ideas to to decompose an input image into intrinsic images using Barron's algorithm which was initially shown to work on objects, but now is being applied to realistic scenes.

Intrinsic Scene Properties from a Single RGB-D Image. Jonathan T. Barron, Jitendra Malik. CVPR, 2013 [pdf]

This is a graph-based localization paper which uses a sort of "Visual Memex" to solve the problem.

Graph-Based Discriminative Learning for Location Recognition. Song Cao, Noah Snavely. CVPR, 2013. [pdf]

This paper provides an exciting new way of localizing contours in images which is orders of magnitude faster than the gPb. There is code available, so the impact is likely to be high.

Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection. Joseph J. Lim, C. Lawrence Zitnick, and Piotr Dollar. CVPR 2013. [ pdf ] [code@github]

Friday, June 21, 2013

[Awesome@CVPR2013] Image Parsing with Regions and Per-Exemplar Detectors

I've been making an inventory of all the awesome papers at this year's CVPR 2013 conference, and one which clearly stood out was Tighe & Lazebnik's paper titled:

Finding Things: Image Parsing with Regions and Per-Exemplar Detectors. Joseph Tighe and Svetlana Lazebnik, CVPR 2013

This paper combines ideas from segmentation-based "scene parsing" (see the below video for the output of their older ECCV2010 SuperParsing system) as well as per-exemplar detectors (see my Exemplar-SVM paper, as well as my older Recognition by Association paper). I have worked and published in these two separate lines of research, so when I tell you that this paper is worthy of reading, you should at least take a look. Below I outline the two ideas which are being synthesized in this paper, but for all details you should read their paper (PDF link). See the overview figure below:

Idea #1: "Segmentation-driven" Image Parsing
The idea of using bottom-up segmentation to parse scenes is not new. Superpixels (very small segments which are likely to contain a single object category) coupled with some machine learning can be used to produce a coherent scene parsing system; however, the boundaries of objects are not as precise as one would expect. This shortcoming stems from the smoothing terms used in random field inference and because generic category-level classifiers have a hard time reasoning about the extent of an object. To see how superpixel-based scene parsing works, check out the video from their older paper from ECCV2010:

ECCV2010 SuperParsing

Idea #2: Per-exemplar segmentation mask transfer
For me, the most exciting thing about this paper is the integration of the segmentation mask transfer from exemplar-based detections. The ideas is quite simple: each detector is exemplar-specific and is thus equipped with its own (precise) segmentation mask. When you produce detections from such exemplar-based systems, you can immediately transfer segmentations in a purely top-down manner. This is what I have been trying to get people excited about for years! Congratulations to Joseph Tighe for incorporating these ideas into a full-blow image interpretation system. To see an example of mask transfer, check out the figure below.

Their system produces a per-pixel labeling of the input image, and as you can see below, the results are quite good. Here are some more outputs of their system as compared to solely region-based as well as solely detector-based systems. Using per-exemplar detectors clearly complements superpixel-based "segmentation-driven" approaches.

This paper will be presented as an oral in the Orals 3C session called "Context and Scenes" to be held on Thursday, June 27th at CVPR 2013 in Portland, Oregon.