Wednesday, October 26, 2011

Google Internship in Vision/ML

Disclaimer: the following post is cross-posted from Yaroslav's "Machine Learning, etc" blog. Since I always rave about my experiences at Google as an intern (did it twice!), I thought some of my fellow readers would find this information useful.  If you are a vision PhD student at CMU or MIT, feel free to ask me more about life at Google.  If you have questions regarding the following internship offer, you'll have to ask Yaroslav.

Original post at:

My group has intern openings for winter and summer. Winter may be too late (but if you really want winter, ping me and I'll find out feasibility). We use OCR for Google Books, frames from YouTube videos, spam images, unreadable PDFs encountered by the crawler, images from Google's StreetView cameras, Android and few other areas. Recognizing individual character candidates is a key step in OCR system. One that machines are not very good at. Even with 0 context, humans are better. This shall not stand!

For example, when I showed the picture below to my Taiwanese coworker he immediately said that these were multiple instance of Chinese "one".

Here are 4 of those images close-up. Classical OCR approaches, have trouble with these characters.

This is a common problem for high-noise domain like camera pictures and digital text rasterized at low resolution. Some results suggest that techniques from Machine Vision can help.

For low-noise domains like Google Books and broken PDF indexing, shortcomings of traditional OCR systems are due to
1) Large number of classes (100k letters in Unicode 6.0)
2) Non-trivial variation within classes
Example of "non-trivial variation"

I found over 100k distinct instances of digital letter 'A' from just one day's crawl worth of documents from the web. Some more examples are here

Chances are that the ideas for human-level classifier are out there. They just haven't been implemented and tested in realistic conditions. We need someone with ML/Vision background to come to Google and implement a great character classifier.

You'd have a large impact if your ideas become part of Tesseract. Through books alone, your code will be run on books from 42 libraries. And since Tesseract is open-source, you'd be contributing to the main OCR effort in the open-source community.

You will get a ton of data, resources and smart people around you. It's a very low bureocracy place. You could run Matlab code on 10k cores if you really wanted, and I know someone who has launched 200k core jobs for a personal project. The infrastructure also makes things easier. Google's MapReduce can sort a petabyte of data (10 trillion strings) with 8000 machines in just 30 mins. Some of the work in our team used features coming from distributed deep belief infrastructure.

In order to get an internship position, you must pass general technical screen that I have no control of. If you are interested in more details, you could contact me directly.  -- Yaroslav

(the link to apply is usually here, but now it's down, will update when it's fixed)

Tuesday, October 25, 2011

NIPS 2011 preview: person grammars and machines-in-the-loop for video annotation

Object Detection with Grammar Models
To appear in NIPS 2011 pdf

Today, I want to point out two upcoming NIPS papers which might be of interest to the Computer Vision community.  First, we have a person detection paper from the hackers who brought you Latent Discriminatively Trained Part-based Models (aka voc-release-3.1 and voc-release-4.0).  I personally don't care for grammars (I think exemplars are a much more data-driven and computation-friendly way of modeling visual concepts), but I think any paper with Pedro on the author list is really worth checking out.  Maybe after I digest all the details, I'll jump on the grammar bandwagon (but I doubt it).  Also of note, is the fact that Pedro Felzenszwalb has relocated to Brown University.

The second paper, is by Carl Vondrick and Deva Ramanan (also of latent-svm fame).  Carl is the author of vatic and a fellow vision@github hacker.  Carl, like myself, has joined Antonio Torralba's group at MIT this fall.  He just started his PhD, so you can only expect the quality of his work to increase without bound over the next ~5 years.  vatic is an online, interactive video annotation tool for computer vision research that crowdsources work to Amazon's Mechanical Turk. Vatic makes it easy to build massive, affordable video data sets and can be deployed on a cloud. Written in Python + C + Javascript, vatic is free and open-source software. The video below showcases the power of vatic.

In this paper, Vondrick et al. use active learning to select the frames which require human annotation.  Rather than simply doing linear interpolation between frames, they are truly putting the "machine-in-the-loop." When doing large-scale video annotation, this approach can supposedly save you tens of thousands of dollars.

Carl Vondrick and Deva Ramanan. "Video Annotation and Tracking with Active LearningNeural Information Processing Systems (NIPS) Granada, Spain, December 2011. [paper] [slides]

Thursday, October 06, 2011

Kinect Object Datasets: Berkeley's B3DO, UW's RGB-D, and NYU's Depth Dataset

Why Kinect?
The Kinect, made by Microsoft, is starting to become quite a common item in Robotics and Computer Vision research.  While the Robotics community has been using the Kinect as a cheap laser sensor which can be used for obstacle avoidance, the vision community has been excited about using the 2.5D data associated with the Kinect for object detection and recognition.  The possibility of building object recognition systems which have access to pixel features as well as 2.5D features is truly exciting for the vision hacker community!

Berkeley's B3DO
First of all, I would like to mention that it looks like the Berkeley Vision Group jumped on the Kinect bandwagon.  But the data collection effort will be crowdsourced -- they need your help!  They need you to use your Kinect to capture your own home/office environments and upload it to their servers  This way, a very large dataset will be collected, and we, the vision hackers, can use machine learning techniques to learn what sofas, desks, chairs, monitors, and paintings look like.  They Berkeley hackers have a paper on this at one of the ICCV 2011 workshops in Barcelona, here is the paper information:

A Category-Level 3-D Object Dataset: Putting the Kinect to Work
Allison JanochSergey KarayevYangqing JiaJonathan T. BarronMario FritzKate SaenkoTrevor Darrell
ICCV-W 2011
[pdf] [bibtex]

UW's RGB-D Object Dataset
On another note, if you want to use 3D for your own object recognition experiments then you might want to check out the following dataset: University of Washington's RGB-D Object Dataset.  With this dataset you'll be able to compare against UW's current state-of-the-art.

In this dataset you will find RGB+Kinect3D data for many household items taken from different views.  Here is the really cool paper which got me excited about the RGB-D Dataset:
A Scalable Tree-based Approach for Joint Object and Pose Recognition
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox
In the Twenty-Fifth Conference on Artificial Intelligence (AAAI), August 2011.

NYU's Depth Dataset
I have to admit that I did not know about this dataset (created by by Nathan Silberman of NYU), until after I blogged about the other two datasets.  Check out the NYU Depth Dataset homepage. However the internet is great, and only a few hours after posted this short blog post, somebody let me know that I left out this really cool NYU dataset.  In fact, it looks like this particular dataset might be at the LabelMe-level regarding dense object annotations, but with accompanying Kinect data.  Rob Fergus & Co strike again!

Nathan Silberman, Rob Fergus. Indoor Scene Segmentation using a Structured Light Sensor. To Appear: ICCV 2011 Workshop on 3D Representation and Recognition