Thursday, June 21, 2012

Predicting events in videos, before they happen. CVPR 2012 Best Paper

Intelligence is all about making inferences given observations, but somewhere in the history of Computer Vision, we (as a community) have put too much emphasis on classification tasks.  What many researchers in the field (unfortunately this includes myself) focus on is extracting semantic meaning from images, image collections, and videos.  Whether the output is a scene category label, an object identity and location, or an action category, the way we proceed is relatively straightforward:
  • Extract some measurements from the image (we call them "features", and SIFT and HOG are two very popular such features)
  • Feed those features into a machine learning algorithm which predicts the category these features belong to.  Some popular choices of algorithms are Neural Networks, SVMs, decision trees, boosted decision stumps, etc.
  • Evaluate our features on a standard dataset (such as Caltech-256, PASCAL VOC, ImageNet, LabelMe, etc)
  • Publish (or as is commonly know in academic circles: publish-or-perish)
While only in the last 5 years has action recognition become popular, it still adheres to the generic machine vision pipeline.  But let's consider a scenario where adhering to this template can hav disastrous consequences.  Let's ask ourselves the following question:

Q: Why did the robot cross the road?
Image courtesy of napkinville.com

A: The robot didn't cross the road -- he was obliterated by a car.  This is because in order to make decisions in the world you can't just wait until all observations happened.  To build a robot that can cross the road, you need to be able to predict things before they happen! (Alternate answer: The robot died because he wasn't using Minh's early-event detection framework, the topic of today's blog post.)

This year's Best Student Paper winner at CVPR has given us a flavor of something more, something beyond the traditional action recognition pipeline, aka "early event detection."  Simply put, the goal is to detect an action before it completes.  Minh's research is rather exciting, which opens up room for a new paradigm in recognition.  If we want intelligent machines roaming the world around us (and every CMU Robotics PhD student knows that this is really what vision is all about), then recognition after an action has happened will not enable our robots to do much beyond passive observation.  Prediction (and not classification) is the killer app of computer vision because classification assumes you are given the data and prediction assumes there is an intent to act on and interpret the future.


While Minh's work focused on simpler actions such as facial recognition, gesture recognition, and human activity recognition, I believe these ideas will help make machines more intelligent and more suitable for performing actions in the real world.

 Disgust detection example from CVPR 2012 paper
 


To give the vision hackers a few more details, this framework uses Structural SVMs (NOTE: trending topic at CVPR) and is able to estimate the probability of an action happening before it actually finishes.  This is something which we, humans, seem to do all the time but has been somehow neglected by machine vision researchers.


Max-Margin Early Event Detectors.
Hoai, Minh & De la Torre, Fernando
CVPR 2012

Abstract:
The need for early detection of temporal events from sequential data arises in a wide spectrum of applications ranging from human-robot interaction to video security. While temporal event detection has been extensively studied, early detection is a relatively unexplored problem. This paper proposes a maximum-margin framework for training temporal event detectors to recognize partial events, enabling early detection. Our method is based on Structured Output SVM, but extends it to accommodate sequential data. Experiments on datasets of varying complexity, for detecting facial expressions, hand gestures, and human activities, demonstrate the benefits of our approach. To the best of our knowledge, this is the first paper in the literature of computer vision that proposes a learning formulation for early event detection.

Early Event Detector Project Page (code available on website)

Minh gave an excellent, enthusiastic, and entertaining presentation during day 3 of CVPR 2012 and was definitely one of the highlights of that day. He received his PhD from CMU's Robotics Institute (like me, yipee!) and is currently a Postdoctoral research scholar in Andrew Zissermann's group in Oxford.  Let's all congratulate Minh for all his hard work.


CVPR 2012 Day 2: optimize, optimize, optimize

Due to popular request, here is my overview of some of the coolest stuff from Day 2 of CVPR 2012 in Providence, RI.  While the Lobster dinner was the highlight for many of us, there were also some serious learning/optimization-based papers presented during Day 2 worthy of sharing.  Here are some of the papers which left me with a very positive impression.


Dennis Strelow of Google Research in Mountain View presented a general framework for Wiberg minimization.  This is a strategy for minimizing objective functions with multiple variables -- objectives which are typically tackled in an EM-style fashion.  The idea is to express one of the variables as a linear function of the other variable, effectively making the problem depend on only one set of variables.  The technique is quite general and has been shown to produce state-of-the-art results on a bundle adjustment problem.  I know Dennis from my second internship at Google where we worked on some sparse-coding problems.  If you perform lots of matrix decomposition problems, check out his paper!


Dennis Strelow
General and Nested Wiberg Minimization
CVPR 2012


Another cool paper which is all about learning is Hossein Mobahi's algorithm for optimizing objectives by smoothing them to avoiding getting stuck in local minima.  This paper is not about blurry images, but about applying Gaussians to objective functions.  In fact, for the problem of image alignment, Hossein provides closed form versions of image operators.  Now when you apply these operators to images, you efficiently smooth the underlying cross-correlation alignment objective.  You decrease the blur, while following the optimum path, and get much nicer answers that doing naive image alignment.


Hossein Mobahi, C. Lawrence Zitnick, Yi Ma
Seeing through the Blur
CVPR 2012


Ira Kemelmacher-Shlizerman, of Photobios fame, showed a really cool algorithm for computing optical flow between two different faces based on learning a subspace (using a large database of faces).  The ideas is quite simple and allows for flowing between two very different faces where the underlying operation produces a sequence of intermediate faces in an interpolation-like manner.  She shared this video with us during her presentation, but it is on Youtube, so now you can enjoy it for yourself.


Ira Kemelmacher-Shlizerman, Steven M. Seitz
Collection Flow
CVPR 2012



Now talk about cool ideas!  Pyry, of CMU fame, presented a recommendation engine for classifiers.  The idea is to take techniques from collaborative filtering (think Netflix!) and apply then to the classifier selection problem.  Pyry has been working on action recognition and the ideas presented in this work are not only quite general, but have are quite intuitive and likely to benefit anybody working with large collections of classifiers.

Pyry Matikainen, Rahul Sukthankar, Martial Hebert
Model Recommendation for Action Recognition
CVPR 2012


And finally, a super-easy algorithm presented for metric learning by Martin Köstinger had me intrigued!  This a Mahalanobis distance metric learning paper which uses equivalence relationships.  This means that you are given pairs of similar items and pairs of dissimilar items.  The underlying algorithm is really not much more than fitting two covariance matrices, one to the positive equivalence relations, and another to the non-equivalence relations.  They have lots of code online, and if you don't believe that such a simple algorithm can beat LMNN (Large-Margin Nearest Neighbor from Killian Weinberger), then get their code and hack away!

Martin Köstinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, Horst Bischof
Large Scale Metric Learning from Equivalence Constraints
CVPR 2012



CVPR 2012 gave us many very math-oriented papers, and while I cannot list of all of them, I hope you found my short list useful.



Tuesday, June 19, 2012

CVPR 2012 Day 1: Accidental Cameras, Large Jigsaws, and Cosegmentation

Today ended the first day of CVPR 2012 in Providence, RI.  And here's a quick recap:
  • On the administrative end of things, Deva Ramanan received an award for his contributions to the field as a new young CVPR researcher.  This is a new nomination-based award so be sure to vote for your favorite vision scientists next year!  Deva's work has truly influenced the field and he is well-known for being a co-author of the Felzenszwalb et al. DPM object detector, but since then he has pushed his ideas on part-based models to the next level.  Congratulations Deva, you are the type of researcher we should all strive to be.  
  • Secondly, it looks like CVPR 2015 will be in Boston.
  • Here are some noteworthy papers from the oral sessions of Day 1:


During the first oral session, Antonio Torralba gave an intriguing talk where he showed the world how accidental anti-pinhole and pin-speck cameras are "all around us."  In his presentation, he showed how a person walking in front of a window can be used to image the world outside of a window.  Additionally he showed a variant of image-based Van-Eck phreaking, where his technique could be used to view what is on a person's computer screen without having to look at the screen directly.

Accidental pinhole and pinspeck cameras: revealing the scene outside the picture
Antonio Torralba and William T. Freeman
CVPR 2012


Andrew Gallagher gave a really great presentation on using computer vision to solve jigsaw puzzles, where not only are the pieces jumbled, but their orientation is unknown.  His algorithm was used to solve really really large puzzles, ones which are much larger than could be tackled by a human.

Jigsaw Puzzles with Pieces of Unknown Orientation
Andrew Gallagher
CVPR 2012


Gunhee Kim presented his newest work on co-segmentation.  He has been working on this for quite some time and if you are interested in segmentation in image collections, you should definitely check it out.

On Multiple Foreground Cosegmentation
Gunhee Kim and Eric P. Xing
CVPR 2012


Sunday, June 17, 2012

Workshop on Egocentric Vision @ CVPR 2012

Today (Sunday 6/17/2012) is the second day of CVPR 2012 workshops and I'll be going to the Egocentric Vision workshop.  The workshop kicks off at 8:50am (come earlier for some CVPR breakfast) and will start with a keynote talk by Takeo Kanade.  There will also be a talk by Hartmut Neven of Neven-vision and now a part of Google.  Also during the poser session, my fellow colleague, Abhinav Shrivastava, will be presenting his work on applying ExemplarSVMs to detection from a first-person point of view --- yet another super-cool application of ExemplarSVMs.

Object detection from first person's view using exemplar SVMs

There are lots of other plenty of cool talks during this workshop including: action recognition from a first-person point of view, experience classification, as well as a study of the obtrusiveness of wearable computing platforms by some fellow MIT vision hackers.

The accuracy-obtrusiveness tradeoff for wearable vision platforms

You might be thinking, "What is egocentric vision?" but nothing explains it better than the following video from Google about its super exciting research project codename Project Glass.  I'm really hoping Hartmut talks about this...


If you're looking for me, you know where I'll be tomorrow.  Happy computing.