Tombone's Computer Vision Blog: torralba

Showing posts with label torralba. Show all posts

Tuesday, April 17, 2012

Using Panoramas for Better Scene Understanding

There's a lot more to automated object interpretation than merely predicting the correct category label. If we want machines to be able to one day interact with objects in the physical world, then predicting additional properties of objects such as their attributes, segmentations, and poses is of utmost importance. This has been one of the key motivations in my own research behind exemplar-based models of object recognition.

The same argument holds for scenes. If we want to build machines which understand environments around them, then they will have to do much more than predict some sloppy "scene category." Consider what happens when a machine automatically analyzes a picture and says that it from the "theatre" category. Well, the picture could be of the stage, the emergency exit, or just about anything else within a theater -- in each of these cases, the "theatre" category would be deemed correct, but would fall short of explaining the content of the image. Most scene understanding papers either focus getting the scene category right, or strive to obtain a pixel-wise semantic segmentation map. However, there's more to scene categories than meets the eye.

Well, there is an interesting paper which will be presented this summer at the CVPR2012 Conference in Rhode Island which tries to bring the concept of "pose" into scene understanding. Pose-estimation has already been well established in the object recognition literature, but this is one of the first serious attempts to bring this new way of thinking into scene understanding.

J. Xiao, K. A. Ehinger, A. Oliva and A. Torralba.

Recognizing Scene Viewpoint using Panoramic Place Representation.
Proceedings of 25th IEEE Conference on Computer Vision and Pattern Recognition, 2012.

The SUN360 panorama project page also has links to code, etc.

The basic representation unit of places in their paper is that of a panorama. If you've ever taken a vision course, then you probably stitched some of your own. Below are some examples of cool looking panoramas from their online gallery. A panorama roughly covers the space of all images you could take while centered within a place.

Car interior panoramas from SUN360 page

Building interior panoramas from SUN360 page

What the proposed algorithm accomplishes is twofold. First it acts like an ordinary scene categorization system, but in addition to producing a meaningful semantic label, it also predicts the likely view within a place. This is very much like predicting that there is a car in an image, and then providing an estimate of the car's orientation. Below are some pictures of inputs (left column), a compass-like visualization which shows the orientation of the picture (with respect to a cylindrical panorama), as well as a depiction of the likely image content to fall outside of the image boundary. The middle column shows per-place mean panoramas (in the style of TorralbaArt), as well as the input image aligned with the mean panorama.

I think panoramas are a very natural representation for places, perhaps not as rich as a full 3D reconstruction of places, but definitely much richer than static photos. If we want to build better image understanding systems, then we should seriously start looking at using richer sources of information as compared to static images. There is only so much you can do with static images and MTurk, thus videos, 3D models, panoramas, etc are likely to be big players in the upcoming years.

Friday, September 09, 2011

My first week at MIT: What is intelligence?

In case anybody hasn't heard the news, I am no longer a PhD student at CMU. After I handed in my camera-ready dissertation, it didn't take long for my CMU advisor to promote me from his 'current students' to 'former students' list on his webpage. Even though I doubt there is anyplace in the world which can rival CMU when it comes to computer vision, I've decided to give MIT a shot. I had wanted to come to MIT for a long time, but 6 years ago I decided to choose CMU's RI over MIT's CSAIL for my computer vision PhD. Life is funny because the paths we take in life aren't dead-ends -- I'm glad I had a second chance to come to MIT.

In case you haven't heard, MIT is a little tech school somewhere in Boston. Lots of undergrads can be caught wearing math Tshirts and posters like the following can be found on the walls of MIT:

A cool (undergrad targeted) poster I saw at MIT

As of last week I'm officially a postdoc in CSAIL and I'll be working with Antonio Torralba and Aude Oliva. I've been closely following both Antonio's and Aude's work over the last several years and getting to work with these giants of vision will surely be a treat. In case you don't know what a postdoc is, it is a generic term used to describe post-PhD researchers with generally short term (1-3 year) appointments. People generally use the term Postdocotral Fellow or Postdoctoral Associate to describe their position in a university. I guess 3 years working on vision as an undergrad and 6 years of working on vision as a grad student just wasn't enough for me...

I've been getting adjusted to my daily commute through scenic Boston, learning about all the cool vision projects in the lab, as well as meeting all the PhD students working with Antonio. Today was the first day of a course which I'm sitting-in on, titled "What is intelligence?". When I saw a course offered by two computer vision titans (Shimon Ullman and Tomaso Poggio), I couldn't resist. Here is the information below:

What is intelligence?

"What is intelligence?" course homepage: http://web.mit.edu/9.s915/www/

Class Times:	Friday 11:00-2:00 pm
Units:	3-0-9
Location:	46-5193 (NOTE: we had to choose a bigger room)
Instructors:	Shimon Ullman and Tomaso Poggio

The class was packed -- we had to relocate to a bigger room. Much of today's lecture was given by Lorenzo Rosasco. Lorenzo is the Team Leader of IIT@MIT. Here is a blurb from IIT@MIT's website describe what this 'center' is all about:

The IIT@MIT lab was founded from an agreement between the Massachusetts Institute of Technology(MIT) and the Istituto Italiano di Tecnologia (IIT). The scientific objective is to develop novel learning and perception technologies – algorithms for learning, especially in the visual perception domain, that are inspired by the neuroscience of sensory systems and are developed within the rapidly growing theory of computational learning. The ultimate goal of this research is to design artificial systems that mimic the remarkable ability of the primate brain to learn from experience and to interpret visual scenes.

Another cool class offered this semester at MIT is Antonio Torralba's Grounding Object Recognition and Scene Understanding.

Thursday, November 05, 2009

The Visual Memex: Visual Object Recognition Without Categories

Figure 1

I have discussed the limitations of using rigid object categories in computer vision, and my CVPR 2008 work on Recognition as Association was a move towards developing a category-free model of objects. I was primarily concerned with local object recognition where the recognition problem was driven by the appearance/shape/texture features derived from within a segment (a region extraction from an image using an image segmentation algorithm). Recognition of objects was done locally and independently per region, since I did not have good model of category-free context at that time. I've given the problem of contextual object reasoning much thought over the past several years, and equipped with the power of graphical models and learning algorithms I now present a model for category-free object relationship reasoning.

Now its 2009, and its no surprise that I have a paper on context. Context is the new beast and all the cool kids are using it for scene understanding; however, categories are used so often for this problem that their use is rarely questioned. In my NIPS 2009 paper, I present a category-free model of object relationships and address the problem of context-only recognition where the goal is to recognize an object solely based on contextual cues. Figure 1 shows an example of such a prediction task. Given K objects and their spatial configuration, is it possible to predict the appearance of a hidden object at some spatial location?

Figure 2

I present a model called the Visual Memex (visualized in Figure 2), which is a non-parametric graph-based model of visual concepts and their interactions. Unlike traditional approaches to object-object modeling which learn potentials between every pair of categories (the number of such pairs scales quadratically with the number of categories), I make no category assumptions for context.

The official paper is out, and can be found on my project page:

Tomasz Malisiewicz, Alexei A. Efros. Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships. In NIPS, December 2009. PDF

Abstract: The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the object's relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralba's proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.

I gave at talk about my work yesterday at CMU's Misc-read and received some good feedback. I'll be at NIPS this December representing this body of research.

Monday, October 19, 2009

Scene Prototype Models for Indoor Image Recognition

In today's post I want to briefly discuss a computer vision paper which has caught my attention.

In the paper Recognizing Indoor Scenes, Quattoni and Torralba build a scene recognition system for categorizing indoor images. Instead of performing learning directly in descriptor space (such as the GIST over the entire image), the authors use a "distance-space" representation. An image is described by a vector of distances to a large number of scene prototypes. A scene prototype consists of a root feature (the global GIST) as well as features belonging to a small number of regions associated with the prototype. One example of such a prototype might be an office scene with a monitor region in the center of the image and a keyboard region below it -- however the ROIs (which can be thought of as parts of the scene) are often more abstract and do not neatly correspond to a single object.

The learning problem (which is solved once per category) is then to find the internal parameters of each prototype as well as the per-class prototype distance weights which are used for classification. From a distance function learning point of view, it is rather interesting to see distances to many exemplars being used as opposed to the distance to a single focal exemplar.

Although the authors report results on the image categorization task it is worthwhile to ask if scene prototypes could be used for object localization. While it is easy to be the evil genius and devise an image that is unique enough such that it doesn't conform to any notion of a prototype, I wouldn't be surprised if 80% of the images we encounter on the internet conform to a few hundred scene prototypes. Of course the problem of learning such prototypes from data without prototype-labeling (which requires expert vision knowledge) is still open. Overall, I like the direction and ideas contained in this research paper and I'm looking forward to see how these ideas develop.