Tombone's Computer Vision Blog: jianxiong xiao

Friday, December 27, 2013

Understanding the Visual World, one 3D reconstruction at a time

The first generation of datasets in the computer vision community were just plain old images -- simple arrays of pixels. Seem like nothing fancy, but we must recall that there was a time where a single image could barely fit inside a computer's memory. During this early time, researchers showcased their image processing algorithms on the infamous Lenna image. But later we saw datasets like the Corel dataset, Caltech 101, LabelMe, SUN, James Hays' 6 million Flick images, PASCAL VOC, and Image Net. These were impressive collections of images and for the first time computer vision researchers did not have to collect their own images. As Jitendra Malik once said, large annotated datasets marked the end of the "Wild Wild West" in Computer Vision -- for the first time, large datasets allowed researchers to compare object recognition algorithms on the same sets of images! What is different about these datasets is that some come with annotations at the image level, some come with annotated polygons, and some come with nothing more than objects annotated at the bounding box level. Images are captured by a camera and annotations are produced by a human annotation effort. But these traditional vision datasets lack depth, 3D information, or anything of that sort. LabelMe3D was an attempt at reconstructing depth from object annotations, but it would only work in a pop-up world kind of way.

The next generation of datasets is all about going into 3D. But not just annotated depth images like the NYU2 Depth Dataset depicted the in following image:

What a 3D Environment dataset (or 3D place dataset) is all about is making 3D reconstructions the basic primitive of research. This means that an actual 3D reconstruction algorithm has to first be ran to create dataset. This is a fairly new idea in the Computer Vision community. The paper which introduces such a dataset, SUN3D, was introduced at this year's ICCV 2013 conference. I briefly outlined the paper in my ICCV 2013 summary blog post, but I felt that this topic is worthy of its own blog post. For those interested, the paper link is below:

J. Xiao, A. Owens and A. Torralba SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels Proceedings of 14th IEEE International Conference on Computer Vision (ICCV2013). paper link

Running a 3D reconstruction algorithm is no easy feat, so Xiao et al. found that some basic polygon-level annotations were sufficient for snapping Structure from Motion algorithms into place. For those of you that don't what know a Structure from Motion (SfM) algorithm is, it is a process which reconstructs the 3D locations of points inside images (the structure) as well as the camera parameters (the motion) for a sequence of images. Xiao et al.'s SfM algorithm uses the depth data from a Kinect sensor in addition to the manually provided object annotations. Checkout their video below:

Depth dataset vs SUN 3D dataset
The NYU2 Depth dataset is useful for studying object detection algorithms which operate on 2.5D images while MIT's SUN 3D dataset is useful for contextual reasoning and object-object relationships. This is important because Kinect images do not give full 3D, they merely return a 2.5D "depth image" from which certain physical relationships cannot be easily inferred.

Startups doing 3D
It is also worthwhile pointing out that Matterport, a new startup, is creating their own sensors and algorithms for helping people create their own 3D reconstructions. Check out their vide below:

What this means for the rest of us
We should expect the next generation of smartphones to have their own 3D sensors. In addition, we should expect the new generation of wearable devices such as Google Glass to give us more than 2D reasoning, they should be able to use this 3D data to make better visual inferences. I'm glad to see 3D getting more and more popular as this allows researchers to work on new problems, new data structures, and push their creativity to the next level!

Tuesday, April 17, 2012

Using Panoramas for Better Scene Understanding

There's a lot more to automated object interpretation than merely predicting the correct category label. If we want machines to be able to one day interact with objects in the physical world, then predicting additional properties of objects such as their attributes, segmentations, and poses is of utmost importance. This has been one of the key motivations in my own research behind exemplar-based models of object recognition.

The same argument holds for scenes. If we want to build machines which understand environments around them, then they will have to do much more than predict some sloppy "scene category." Consider what happens when a machine automatically analyzes a picture and says that it from the "theatre" category. Well, the picture could be of the stage, the emergency exit, or just about anything else within a theater -- in each of these cases, the "theatre" category would be deemed correct, but would fall short of explaining the content of the image. Most scene understanding papers either focus getting the scene category right, or strive to obtain a pixel-wise semantic segmentation map. However, there's more to scene categories than meets the eye.

Well, there is an interesting paper which will be presented this summer at the CVPR2012 Conference in Rhode Island which tries to bring the concept of "pose" into scene understanding. Pose-estimation has already been well established in the object recognition literature, but this is one of the first serious attempts to bring this new way of thinking into scene understanding.

J. Xiao, K. A. Ehinger, A. Oliva and A. Torralba.

Recognizing Scene Viewpoint using Panoramic Place Representation.
Proceedings of 25th IEEE Conference on Computer Vision and Pattern Recognition, 2012.

The SUN360 panorama project page also has links to code, etc.

The basic representation unit of places in their paper is that of a panorama. If you've ever taken a vision course, then you probably stitched some of your own. Below are some examples of cool looking panoramas from their online gallery. A panorama roughly covers the space of all images you could take while centered within a place.

Car interior panoramas from SUN360 page

Building interior panoramas from SUN360 page

What the proposed algorithm accomplishes is twofold. First it acts like an ordinary scene categorization system, but in addition to producing a meaningful semantic label, it also predicts the likely view within a place. This is very much like predicting that there is a car in an image, and then providing an estimate of the car's orientation. Below are some pictures of inputs (left column), a compass-like visualization which shows the orientation of the picture (with respect to a cylindrical panorama), as well as a depiction of the likely image content to fall outside of the image boundary. The middle column shows per-place mean panoramas (in the style of TorralbaArt), as well as the input image aligned with the mean panorama.

I think panoramas are a very natural representation for places, perhaps not as rich as a full 3D reconstruction of places, but definitely much richer than static photos. If we want to build better image understanding systems, then we should seriously start looking at using richer sources of information as compared to static images. There is only so much you can do with static images and MTurk, thus videos, 3D models, panoramas, etc are likely to be big players in the upcoming years.

Friday, December 27, 2013

Understanding the Visual World, one 3D reconstruction at a time

Tuesday, April 17, 2012

Using Panoramas for Better Scene Understanding

Subscribe To