Wednesday, April 18, 2012

One Part Basis to Rule them All: Steerable Part Models

Last week, some of us vision hackers at MIT started an Object Recognition Reading Group.  The group is currently in stealth-mode, but our goal is to analyze, criticize, and re-synthesize ideas from the object detection/recognition community.  To inaugurate the group, I covered Hamed Pirsiavash's Steerable Part Models paper from the upcoming CVPR 2012 conference.  As background reading, I had to go over the mathematical basics of learning with tensors (i.e., multidimensional arrays) which were outlined in their earlier NIPS 2009 paper, Bilinear Classifiers for Visual Recognition.  After reading up on their work, I have a better grasp of what the trace operator actually does.  It is nothing more than a Hermitian inner product defined between the space of linear operators from C^N to C^M (see post here for geometric interpretations of the trace).

Hamed Pirsiavash, Deva Ramanan, "Steerable part models", CVPR 2012

"Our representation can be seen as an approach to sharing parts." 
-- H. Pirisiavash and D. Ramanan

The idea behind this paper is relatively simple -- instead of learning category-specific part-models, learn a part-basis from which all category-specific part models come from.  Consider the different parts learned from a deformable part model (see Felzenszwalb's DPM page for more info about DPMs) and their depiction below.  If you take a close look you see that the parts are quite general, and it makes sense to assume that there is a finite basis from which these parts come from.

Parts from a Part-model

The model learns a steerable basis by factoring the matrix of all part models into the product of two low rank matrices, and because the basis is shared across categories, this performs both dimensionality reduction (like to help prevent over-fitting as well as speed up the final detectors) and sharing (likely to boost performance).

The learned steerable basis

While the objective function is not convex, it can be tackled via a simple alternating optimization algorithm where the resulting sub-objectives are convex and can be optimized using off-the-shelf Linear SVM solvers.  They call this property bi-convexity, and it doesn't guarantee finding the global optimum, just makes using standard tools easy.

While the results on PASCAL VOC2007, do not show an improvement in performance (VOC2007 is not a very good dataset for sharing as there are only a few category combinations which should in theory benefit significantly from sharing (e.g., bicycle and motorbike)), they show a significant computational speed up.  Below is a picture of the part-based car model from Felzenszwalb et al, as well as the one from their steerable basis approach.  Note that the HOG visualizations look very similar.

In conclusion, this is one paper worthy of checking out if you are serious about object recognition research.  The simplicity of the approach is a strong point, and if you are a HOG-hacker (like many of us these days) then you will be able to understand the paper without a problem.

Tuesday, April 17, 2012

Using Panoramas for Better Scene Understanding

There's a lot more to automated object interpretation than merely predicting the correct category label.  If we want machines to be able to one day interact with objects in the physical world, then predicting additional properties of objects such as their attributes, segmentations, and poses is of utmost importance.  This has been one of the key motivations in my own research behind exemplar-based models of object recognition.

The same argument holds for scenes.  If we want to build machines which understand environments around them, then they will have to do much more than predict some sloppy "scene category."  Consider what happens when a machine automatically analyzes a picture and says that it from the "theatre" category.  Well, the picture could be of the stage, the emergency exit, or just about anything else within a theater -- in each of these cases, the "theatre" category would be deemed correct, but would fall short of explaining the content of the image.  Most scene understanding papers either focus getting the scene category right, or strive to obtain a pixel-wise semantic segmentation map.  However, there's more to scene categories than meets the eye.

Well, there is an interesting paper which will be presented this summer at the CVPR2012 Conference in Rhode Island which tries to bring the concept of "pose" into scene understanding.  Pose-estimation has already been well established in the object recognition literature, but this is one of the first serious attempts to bring this new way of thinking into scene understanding.

J. Xiao, K. A. Ehinger, A. Oliva and A. Torralba.
Recognizing Scene Viewpoint using Panoramic Place Representation.
Proceedings of 25th IEEE Conference on Computer Vision and Pattern Recognition, 2012.

The SUN360 panorama project page also has links to code, etc.

The basic representation unit of places in their paper is that of a panorama.  If you've ever taken a vision course, then you probably stitched some of your own.  Below are some examples of cool looking panoramas from their online gallery.  A panorama roughly covers the space of all images you could take while centered within a place.

Car interior panoramas from SUN360 page
 Building interior panoramas from SUN360 page

What the proposed algorithm accomplishes is twofold.  First it acts like an ordinary scene categorization system, but in addition to producing a meaningful semantic label, it also predicts the likely view within a place.  This is very much like predicting that there is a car in an image, and then providing an estimate of the car's orientation.  Below are some pictures of inputs (left column), a compass-like visualization which shows the orientation of the picture (with respect to a cylindrical panorama), as well as a depiction of the likely image content to fall outside of the image boundary.  The middle column shows per-place mean panoramas (in the style of TorralbaArt), as well as the input image aligned with the mean panorama.

I think panoramas are a very natural representation for places, perhaps not as rich as a full 3D reconstruction of places, but definitely much richer than static photos.  If we want to build better image understanding systems, then we should seriously start looking at using richer sources of information as compared to static images.  There is only so much you can do with static images and MTurk, thus videos, 3D models, panoramas, etc are likely to be big players in the upcoming years.