Tombone's Computer Vision Blog: kinect

Showing posts with label kinect. Show all posts

Friday, December 27, 2013

Understanding the Visual World, one 3D reconstruction at a time

The first generation of datasets in the computer vision community were just plain old images -- simple arrays of pixels. Seem like nothing fancy, but we must recall that there was a time where a single image could barely fit inside a computer's memory. During this early time, researchers showcased their image processing algorithms on the infamous Lenna image. But later we saw datasets like the Corel dataset, Caltech 101, LabelMe, SUN, James Hays' 6 million Flick images, PASCAL VOC, and Image Net. These were impressive collections of images and for the first time computer vision researchers did not have to collect their own images. As Jitendra Malik once said, large annotated datasets marked the end of the "Wild Wild West" in Computer Vision -- for the first time, large datasets allowed researchers to compare object recognition algorithms on the same sets of images! What is different about these datasets is that some come with annotations at the image level, some come with annotated polygons, and some come with nothing more than objects annotated at the bounding box level. Images are captured by a camera and annotations are produced by a human annotation effort. But these traditional vision datasets lack depth, 3D information, or anything of that sort. LabelMe3D was an attempt at reconstructing depth from object annotations, but it would only work in a pop-up world kind of way.

The next generation of datasets is all about going into 3D. But not just annotated depth images like the NYU2 Depth Dataset depicted the in following image:

What a 3D Environment dataset (or 3D place dataset) is all about is making 3D reconstructions the basic primitive of research. This means that an actual 3D reconstruction algorithm has to first be ran to create dataset. This is a fairly new idea in the Computer Vision community. The paper which introduces such a dataset, SUN3D, was introduced at this year's ICCV 2013 conference. I briefly outlined the paper in my ICCV 2013 summary blog post, but I felt that this topic is worthy of its own blog post. For those interested, the paper link is below:

J. Xiao, A. Owens and A. Torralba SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels Proceedings of 14th IEEE International Conference on Computer Vision (ICCV2013). paper link

Running a 3D reconstruction algorithm is no easy feat, so Xiao et al. found that some basic polygon-level annotations were sufficient for snapping Structure from Motion algorithms into place. For those of you that don't what know a Structure from Motion (SfM) algorithm is, it is a process which reconstructs the 3D locations of points inside images (the structure) as well as the camera parameters (the motion) for a sequence of images. Xiao et al.'s SfM algorithm uses the depth data from a Kinect sensor in addition to the manually provided object annotations. Checkout their video below:

Depth dataset vs SUN 3D dataset
The NYU2 Depth dataset is useful for studying object detection algorithms which operate on 2.5D images while MIT's SUN 3D dataset is useful for contextual reasoning and object-object relationships. This is important because Kinect images do not give full 3D, they merely return a 2.5D "depth image" from which certain physical relationships cannot be easily inferred.

Startups doing 3D
It is also worthwhile pointing out that Matterport, a new startup, is creating their own sensors and algorithms for helping people create their own 3D reconstructions. Check out their vide below:

What this means for the rest of us
We should expect the next generation of smartphones to have their own 3D sensors. In addition, we should expect the new generation of wearable devices such as Google Glass to give us more than 2D reasoning, they should be able to use this 3D data to make better visual inferences. I'm glad to see 3D getting more and more popular as this allows researchers to work on new problems, new data structures, and push their creativity to the next level!

Wednesday, July 03, 2013

[CVPR 2013] Three Trending Computer Vision Research Areas

As I walked through the large poster-filled hall at CVPR 2013, I asked myself, “Quo vadis Computer Vision?" (Where are you going, computer vision?) I see lots of papers which exploit last year’s ideas, copious amounts of incremental research, and an overabundance of off-the-shelf computational techniques being recombined in seemingly novel ways. When you are active in computer vision research for several years, it is not rare to find oneself becoming bored by a significant fraction of papers at research conferences. Right after the main CVPR conference, I felt mentally drained and needed to get a breath of fresh air, so I spent several days checking out the sights in Oregon. Here is one picture -- proof that the CVPR2013 had more to offer than ideas!

When I returned from sight-seeing, I took a more circumspect look at the field of computer vision. I immediately noticed that vision research is actually advancing and growing in a healthy way. (Unfortunately, most junior students have a hard determining which research papers are actually novel and/or significant.) A handful of new research themes arise each year, and today I’d like to briefly discuss three new computer vision research themes which are likely to rise in popularity in the foreseeable future (2-5 years).

1) RGB-D input data is trending.

Many of this year’s papers take a single 2.5D RGB-D image as input and try to parse the image into its constituent objects. The number of papers doing this with RGBD data is seemingly infinite. Some other CVPR 2013 approaches don’t try to parse the image, but instead do something else like: fit cuboids, reason about affordances in 3D, or reason about illumination. The reason why such inputs are becoming more popular is simple: RGB-D images can be obtained via cheap and readily available sensors such as Microsoft’s Kinect. Depth measurements used to be obtained by expensive time of flight sensors (in the late 90s and early 00s), but as of 2013, $150 can buy you one these depth sensing bad-boys! In fact, I had bought a Kinect just because I thought that it might come in handy one day -- and since I’ve joined MIT, I’ve been delving into the RGB-D reconstruction domain on my own. It is just a matter of time until the newest iPhone has an on-board depth sensor, so the current line of research which relies on RGB-D input is likely to become the norm within a few years.

H. Jiang and J. Xiao. A Linear Approach to Matching Cuboids in RGBD Images. In CVPR 2013. [pdf] [code]

2) Mid-level patch discovery is a hot research topic. Saurabh Singh from CMU introduced this idea in his seminal ECCV 2012 paper, and Carl Doersch applied this idea to large-scale Google Street-View imagery in the “What makes Paris look like Paris?” SIGGRAPH 2012 paper. The idea is to automatically extract mid-level patches (which could be objects, object parts, or just chunks of stuff) from images with the constraint that those are the most informative patches. Regarding the SIGGRAPH paper, see the video below.

Unsupervised Discovery of Mid-Level Discriminative Patches Saurabh Singh, Abhinav Gupta, Alexei A. Efros. In ECCV, 2012.

Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What Makes Paris Look like Paris? In SIGGRAPH 2012. [pdf]

At CVPR 2013, it was evident that the idea of "learning mid-level parts for scenes" is being pursued by other top-tier computer vision research groups. Here are some CVPR 2013 papers which capitalize on this idea:

Blocks that Shout: Distinctive Parts for Scene Classification. Mayank Juneja, Andrea Vedaldi, CV Jawahar, Andrew Zisserman. In CVPR, 2013. [pdf]

Representing Videos using Mid-level Discriminative Patches. Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry Davis. CVPR, 2013. [pdf]

Part Discovery from Partial Correspondence. Subhransu Maji, Gregory Shakhnarovich. In CVPR, 2013. [pdf]

3) Deep-learning and feature learning are on the rise within the Computer Vision community.
It seems that everybody at Google Research is working on Deep-learning. Will it solve all vision problems? Is it the one computational ring to rule them all? Personally, I doubt it, but the rising presence of deep learning is forcing every researcher to brush up on their l33t backprop skillz. In other words, if you don't know who Geoff Hinton is, then you are in trouble.

Thursday, October 06, 2011

Kinect Object Datasets: Berkeley's B3DO, UW's RGB-D, and NYU's Depth Dataset

Why Kinect?

www.pirobot.org

The Kinect, made by Microsoft, is starting to become quite a common item in Robotics and Computer Vision research. While the Robotics community has been using the Kinect as a cheap laser sensor which can be used for obstacle avoidance, the vision community has been excited about using the 2.5D data associated with the Kinect for object detection and recognition. The possibility of building object recognition systems which have access to pixel features as well as 2.5D features is truly exciting for the vision hacker community!

Berkeley's B3DO

First of all, I would like to mention that it looks like the Berkeley Vision Group jumped on the Kinect bandwagon. But the data collection effort will be crowdsourced -- they need your help! They need you to use your Kinect to capture your own home/office environments and upload it to their servers This way, a very large dataset will be collected, and we, the vision hackers, can use machine learning techniques to learn what sofas, desks, chairs, monitors, and paintings look like. They Berkeley hackers have a paper on this at one of the ICCV 2011 workshops in Barcelona, here is the paper information:

kinectdata.com

A Category-Level 3-D Object Dataset: Putting the Kinect to Work
Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, Trevor Darrell
ICCV-W 2011
[pdf] [bibtex]

UW's RGB-D Object Dataset

On another note, if you want to use 3D for your own object recognition experiments then you might want to check out the following dataset: University of Washington's RGB-D Object Dataset. With this dataset you'll be able to compare against UW's current state-of-the-art.

In this dataset you will find RGB+Kinect3D data for many household items taken from different views. Here is the really cool paper which got me excited about the RGB-D Dataset:
A Scalable Tree-based Approach for Joint Object and Pose Recognition
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox
In the Twenty-Fifth Conference on Artificial Intelligence (AAAI), August 2011.

NYU's Depth Dataset

I have to admit that I did not know about this dataset (created by by Nathan Silberman of NYU), until after I blogged about the other two datasets. Check out the NYU Depth Dataset homepage. However the internet is great, and only a few hours after posted this short blog post, somebody let me know that I left out this really cool NYU dataset. In fact, it looks like this particular dataset might be at the LabelMe-level regarding dense object annotations, but with accompanying Kinect data. Rob Fergus & Co strike again!

Nathan Silberman, Rob Fergus. Indoor Scene Segmentation using a Structured Light Sensor. To Appear: ICCV 2011 Workshop on 3D Representation and Recognition

Tuesday, March 15, 2011

kinect fun

Ah, the guitar. Ah, a kinect.

Kinect plugged into my Macbook Air.

Friday, December 31, 2010

why I should be hacking with a kinect

It was recently brought to my attention that Alex Berg a.k.a. Alexander Berg is hacking with a Kinect.
In case you didn't know, Alex Berg is an assistant professor at Stony Brook University as of Sept 2010. He came out of Jitendra Malik's group, and can be thought of as my academic uncle (because he got his PhD with Jitendra at basically the same time as my advisor, Alyosha Efros). I am a big fan of Alex Berg's work. (See the paper at ECCV 2010: What does classifying more than 10,000 image categories tell us? and note his upcoming workshop "Large Scale Learning for Vision" at CVPR 2011).

I had already known that Xiaofeng Ren has been hacking with RGB-D cameras such as the Kinect for some time now. Xiaofeng (pronunciation of first name) Ren is a research scientist at Intel Labs Seattle since 2008 and on the affiliate faculty at the CSE department at UW since 2010. He is another one my many academic uncles and has contributed greatly to the field of Computer Vision. For some of his recent work with Kinects, see his RGB-D project page. Xiaofeng Ren's work has also been very influential during my own research -- it is worthwhile to recall that he coined the term "superpixels", which is prevalent in contemporary Computer Vision literature.

So when I learned that these bad-ass ex-Berkeley hackers are hacking with Kinects, I figured it was the time to acquire one of my own. I bought a Kinect today and plan on playing with Alex Berg's kinect2matlab interface for Mac OS X soon!

So, why aren't you hacking with a kinect?