Tombone's Computer Vision Blog: felzenszwalb

Showing posts with label felzenszwalb. Show all posts

Tuesday, October 25, 2011

NIPS 2011 preview: person grammars and machines-in-the-loop for video annotation

Ross Girshick, Pedro Felzenszwalb, David McAllester

Object Detection with Grammar Models

To appear in NIPS 2011 pdf

Today, I want to point out two upcoming NIPS papers which might be of interest to the Computer Vision community. First, we have a person detection paper from the hackers who brought you Latent Discriminatively Trained Part-based Models (aka voc-release-3.1 and voc-release-4.0). I personally don't care for grammars (I think exemplars are a much more data-driven and computation-friendly way of modeling visual concepts), but I think any paper with Pedro on the author list is really worth checking out. Maybe after I digest all the details, I'll jump on the grammar bandwagon (but I doubt it). Also of note, is the fact that Pedro Felzenszwalb has relocated to Brown University.

The second paper, is by Carl Vondrick and Deva Ramanan (also of latent-svm fame). Carl is the author of vatic and a fellow vision@github hacker. Carl, like myself, has joined Antonio Torralba's group at MIT this fall. He just started his PhD, so you can only expect the quality of his work to increase without bound over the next ~5 years. vatic is an online, interactive video annotation tool for computer vision research that crowdsources work to Amazon's Mechanical Turk. Vatic makes it easy to build massive, affordable video data sets and can be deployed on a cloud. Written in Python + C + Javascript, vatic is free and open-source software. The video below showcases the power of vatic.

In this paper, Vondrick et al. use active learning to select the frames which require human annotation. Rather than simply doing linear interpolation between frames, they are truly putting the "machine-in-the-loop." When doing large-scale video annotation, this approach can supposedly save you tens of thousands of dollars.

Carl Vondrick and Deva Ramanan. "Video Annotation and Tracking with Active Learning" Neural Information Processing Systems (NIPS) Granada, Spain, December 2011. [paper] [slides]

Tuesday, August 16, 2011

Question: What makes an object recognition system great?

Today, instead of discussing my own perspectives on object recognition or sharing some useful links, I would like to ask a general question geared towards anybody working in the field of computer vision:

What makes an object recognition system great?

In particular, I would like to hear a broad range of perspectives regarding what is necessary to provide an impact-creating open-source object recognition system for the research community to use. As a graduate student you might be interested in building your own recognition system, as a researcher you might be interested in extending or comparing against a current system, and as an educator you might want to to direct your students to a fully-functional object recognition system which could be used to bootstrap their research.

To start the discussion I would like to first enumerate a few elements which I find important in making an object recognition system great.

Open Source

In order for object recognition to progress, I think releasing binary executables is simply not enough. Allowing others to see your source code means that you gain more scientific credibility and you let others extend your system -- this means letting others both train and test variants of your system. More people using an object recognition system also translates to a high citation count, which is favorable for researchers seeking career advancement. Felzenszwalb et al. have released multiple open-source version of their Discriminatively Trained Deformable Part Model -- each time we see a new release it gets better! Such continual development means that we know the authors really care about this problem. I feel Github, with its distributed version control and social-coding features, is a powerful took the community should adopt, something which I believe is very much needed to take the community's ideas to the next level. In my own research (e.g., the Ensemble of Exemplar-SVMs approach), I have started using Github (for both private and public development) and I love it. Linux might have been started by a single individual, but it took a community to make it great. Just look at where Linux is now.

Ease of use

For ease of use, it is important that the system is implemented in a popular language which is known by a large fraction of the vision community. Matlab, Python, C++, and Java are such popular language and many good implementations are a combination of Matlab with some highly-optimized routines in C++. Good documentation is also important since one cannot expect only experts to be using such a system.

Strong research results

The YaRS approach, which is the "yet-another-recognition-system" approach, doesn't translate to high usage unless the system actually performs well on a well-accepted object recognition task. Every year at vision conferences, many new recognition frameworks are introduced, but really only a few of them ever pass the test of time. Usually an ideas withstands time because it is a conceptual contribution to science, but systems such as the HOG-based pedestrian detector of Dalal-Triggs and the Latent Deformable Part Model of Felzenszwalb et al. are actually being used by many other researchers. The ideas in these works are not only good, but the recognition systems are great.

Question:

So what would you like to see in the next generation of object recognition systems? I will try my best to reply to any comments posted below. Any really great comment might even trigger a significant discussion; enough to warrant its own blog post. Anybody is welcome to comment/argue/speculate below, either using their real name or anonymously.

Saturday, August 13, 2011

blazing fast nms.m (from exemplar-svm library)

If you care about building large-scale object recognition systems, you have to care about speed. And every little bit of performance counts -- so why not first optimize the stuff which is slowing you down?

NMS (non-maximum suppression) is a very popular post-processing method for eliminating redundant object detection windows. I have take Felzenszwalb et al.'s nms.m and made it significantly faster by eliminating an inner loop. 6 years of grad school, 6 years of building large-scale vision systems in Matlab, and you really learn how to vectorize code. The code I call millions of times needs to be fast, and nms is one of those routines I call all the time.

The code is found below as a Github gist -- which was taken from my Exemplar-SVM object recognition library (from my ICCV2011 paper: Ensemble of Exemplar-SVMs for Object Detection and Beyond). The same file, nms.m, can also be found a part of the Exemplar-SVM library on Github. In fact, this code produces the same result as Pedro's code, but is much faster. Here is one timing experiment I performed when performing nms on ~300K windows, where my version is roughly 100 times faster. When you deal with exemplar-SVMs you have to deal with lots of detectors (i.e., lots of detection windows), so fast NMS is money.

>> tic;top = nms_original(bbs,.5);toc
Elapsed time is 58.172313 seconds.

>> tic;top = nms_fast(bbs,.5);toc
Elapsed time is 0.532638 seconds.