Tombone's Computer Vision Blog: vision

Showing posts with label vision. Show all posts

Tuesday, January 20, 2015

From feature descriptors to deep learning: 20 years of computer vision

We all know that deep convolutional neural networks have produced some stellar results on object detection and recognition benchmarks in the past two years (2012-2014), so you might wonder: what did the earlier object recognition techniques look like? How do the designs of earlier recognition systems relate to the modern multi-layer convolution-based framework?

Let's take a look at some of the big ideas in Computer Vision from the last 20 years.

The rise of the local feature descriptors: ~1995 to ~2000

When SIFT (an acronym for Scale Invariant Feature Transform) was introduced by David Lowe in 1999, the world of computer vision research changed almost overnight. It was robust solution to the problem of comparing image patches. Before SIFT entered the game, people were just using SSD (sum of squared distances) to compare patches and not giving it much thought.

The SIFT recipe: gradient orientations, normalization tricks

SIFT is something called a local feature descriptor -- it is one of those research findings which is the result of one ambitious man hackplaying with pixels for more than a decade. Lowe and the University of British Columbia got a patent on SIFT and Lowe released a nice compiled binary of his very own SIFT implementation for researchers to use in their work. SIFT allows a point inside an RGB imagine to be represented robustly by a low dimensional vector. When you take multiple images of the same physical object while rotating the camera, the SIFT descriptors of corresponding points are very similar in their 128-D space. At first glance it seems silly that you need to do something as complex as SIFT, but believe me: just because you, a human, can look at two image patches and quickly "understand" that they belong to the same physical point, this is not the same for machines. SIFT had massive implications for the geometric side of computer vision (stereo, Structure from Motion, etc) and later became the basis for the popular Bag of Words model for object recognition.

Seeing a technique like SIFT dramatically outperform an alternative method like Sum-of-Squared-Distances (SSD) Image Patch Matching firsthand is an important step in every aspiring vision scientist's career. And SIFT isn't just a vector of filter bank responses, the binning and normalization steps are very important. It is also worthwhile noting that while SIFT was initially (in its published form) applied to the output of an interest point detector, later it was found that the interest point detection step was not important in categorization problems. For categorization, researchers eventually moved towards vector quantized SIFT applied densely across an image.

I should also mention that other descriptors such as Spin Images (see my 2009 blog post on spin images) came out a little bit earlier than SIFT, but because Spin Images were solely applicable to 2.5D data, this feature's impact wasn't as great as that of SIFT.

The modern dataset (aka the hardening of vision as science): ~2000 to ~2005

Homography estimation, ground-plane estimation, robotic vision, SfM, and all other geometric problems in vision greatly benefited from robust image features such as SIFT. But towards the end of the 1990s, it was clear that the internet was the next big thing. Images were going online. Datasets were being created. And no longer was the current generation solely interested in structure recovery (aka geometric) problems. This was the beginning of the large-scale dataset era with Caltech-101 slowly gaining popularity and categorization research on the rise. No longer were researchers evaluating their own algorithms on their own in-house datasets -- we now had a more objective and standard way to determine if yours is bigger than mine. Even though Caltech-101 is considered outdated by 2015 standards, it is fair to think of this dataset as the Grandfather of the more modern ImageNet dataset. Thanks Fei-Fei Li.

Category-based datasets: the infamous Caltech-101 TorralbaArt image

Bins, Grids, and Visual Words (aka Machine Learning meets descriptors): ~2000 to ~2005
After the community shifted towards more ambitious object recognition problems and away from geometry recovery problems, we had a flurry of research in Bag of Words, Spatial Pyramids, Vector Quantization, as well as machine learning tools used in any and all stages of the computer vision pipeline. Raw SIFT was great for wide-baseline stereo, but it wasn't powerful enough to provide matches between two distinct object instances from the same visual object category. What was needed was a way to encode the following ideas: object parts can deform relative to each other and some image patches can be missing. Overall, a much more statistical way to characterize objects was needed.

Visual Words were introduced by Josef Sivic and Andrew Zisserman in approximately 2003 and this was a clever way of taking algorithms from large-scale text matching and applying them to visual content. A visual dictionary can be obtained by performing unsupervised learning (basically just K-means) on SIFT descriptors which maps these 128-D real-valued vectors into integers (which are cluster center assignments). A histogram of these visual words is a fairly robust way to represent images. Variants of the Bag of Words model are still heavily utilized in vision research.

Josef Sivic's "Video Google": Matching Graffiti inside the Run Lola Run video

Another idea which was gaining traction at the time was the idea of using some sort of binning structure for matching objects. Caltech-101 images mostly contained objects, so these grids were initially placed around entire images, and later on they would be placed around object bounding boxes. Here is a picture from Kristen Grauman's famous Pyramid Match Kernel paper which introduced a powerful and hierarchical way of integrating spatial information into the image matching process.

Grauman's Pyramid Match Kernel for Improved Image Matching

At some point it was not clear whether researchers should focus on better features, better comparison metrics, or better learning. In the mid 2000s it wasn't clear if young PhD students should spend more time concocting new descriptors or kernelizing their support vector machines to death.

Object Templates (aka the reign of HOG and DPM): ~2005 to ~2010

At around 2005, a young researcher named Navneet Dalal showed the world just what can be done with his own new badass feature descriptor, HOG. (It is sometimes written as HoG, but because it is an acronym for “Histogram of Oriented Gradients” it should really be HOG. The confusion must have came from an earlier approach called DoG which stood for Difference of Gaussian, in which case the “o” should definitely be lower case.)

Navneet Dalal's HOG Descriptor

HOG came at the time when everybody was applying spatial binning to bags of words, using multiple layers of learning, and making their systems overly complicated. Dalal’s ingenious descriptor was actually quite simple. The seminal HOG paper was published in 2005 by Navneet and his PhD advisor, Bill Triggs. Triggs got his fame from earlier work on geometric vision, and Dr. Dalal got his fame from his newly found descriptor. HOG was initially applied to the problem of pedestrian detection, and one of the reasons it because so popular was that the machine learning tool used on top of HOG was quite simple and well understood, it was the linear Support Vector Machine.

I should point out that in 2008, a follow-up paper on object detection, which introduced a technique called the Deformable Parts-based Model (or DPM as we vision guys call it), helped reinforce the popularity and strength of the HOG technique. I personally jumped on the HOG bandwagon in about 2008. My first few years as a grad student (2005-2008) I was hackplaying with my own vector quantized filter bank responses, and definitely developed some strong intuition regarding features. In the end I realized that my own features were only "okay," and because I was applying them to the outputs of image segmentation algorithms they were extremely slow. Once I started using HOG, it didn’t take me long to realize there was no going back to custom, slow, features. Once I started using a multiscale feature pyramid with a slightly improved version of HOG introduced by master hackers such as Ramanan and Felzenszwalb, I was processing images at 100x the speed of multiple segmentations + custom features (my earlier work).

The infamous Deformable Part-based Model (for a Person)

DPM was the reigning champ on the PASCAL VOC challenge, and one of the reasons why it became so popular was the excellent MATLAB/C++ implementation by Ramanan and Felzenszwalb. I still know many researchers who never fully acknowledged what releasing such great code really meant for the fresh generation of incoming PhD students, but at some point it seems like everybody was modifying the DPM codebase for their own CVPR attempts. Too many incoming students were lacking solid software engineering skills and giving them the DPM code was a surefire way to get some some experiments up and running. Personally, I never jumped on the parts-based methodology, but I did take apart the DPM codebase several times. However, when I put it back together, the Exemplar-SVM was the result.

Big data, Convolutional Neural Networks and the promise of Deep Learning: ~2010 to ~2015

Sometime around 2008, it was pretty clear that scientists were getting more and more comfortable with large datasets. It wasn’t just the rise of “Cloud Computing” and “Big Data,” it was the rise of the data scientists. Hacking on equations by morning, developing a prototype during lunch, deploying large scale computations in the evening, and integrating the findings into a production system by sunset. I spent two summers at Google Research, I saw lots of guys who had made their fame as vision hackers. But they weren’t just writing “academic” papers at Google -- sharding datasets with one hand, compiling results for their managers, writing Borg scripts in their sleep, and piping results into gnuplot (because Jedis don’t need GUIs?). It was pretty clear that big data, and a DevOps mentality was here to stay, and the vision researcher of tomorrow would be quite comfortable with large datasets. No longer did you need one guy with a mathy PhD, one software engineer, one manager, and one tester. Plenty of guys who could do all of those jobs.

Deep Learning: 1980s - 2015

2014 was definitely a big year for Deep Learning. What’s interesting about Deep Learning is that it is a very old technique. What we're seeing now is essentially the Neural Network 2.0 revolution -- but this time around, there's we're 20 years ahead R&D-wise and our computers are orders of magnitude faster. And what’s funny is that the same guys that were championing such techniques in the early 90s were the same guys we were laughing at in the late 90s (because clearly convex methods were superior to the magical NN learning-rate knobs). I guess they really had the last laugh because eventually these relentless neural network gurus became the same guys we now all look up to. Geoffrey Hinton, Yann LeCun, Andrew Ng, and Yeshua Bengio are the 4 Titans of Deep Learning. By now, just about everybody has jumped ship to become a champion of Deep Learning.

But with Google, Facebook, Baidu, and a multitude of little startups riding the Deep Learning wave, who will rise to the top as the master of artificial intelligence?

Yann's Deep Learning Page

How to today's deep learning systems resemble the recognition systems of yesteryear?

Multiscale convolutional neural networks aren't that much different than the feature-based systems of the past. The first level neurons in deep learning systems learn to utilize gradients in a way that is similar to hand-crafted features such as SIFT and HOG. Objects used to be found in a sliding-window fashion, but now it is easier and sexier to think of this operation as convolving an image with a filter. Some of the best detection systems used to use multiple linear SVMs, combined in some ad-hoc way, and now we are essentially using even more of such linear decision boundaries. Deep learning systems can be thought of a multiple stages of applying linear operators and piping them through a non-linear activation function, but deep learning is more similar to a clever combination of linear SVMs than a memory-ish Kernel-based learning system.

Features these days aren't engineered by hand. However, architectures of Deep systems are still being designed manually -- and it looks like the experts are the best at this task. The operations on the inside of both classic and modern recognition systems are still very much the same. You still need to be clever to play in the game, but now you need a big computer. There's still lot of room for improvement, so I encourage all of you to be creative in your research.

Research-wise, it never hurts to know where we have been before so that we can better plan for our journey ahead. I hope you enjoyed this brief history lesson and the next time you look for insights in your research, don't be afraid to look back.

To learn more about computer vision techniques:

SIFT article on Wikipedia

Bag of Words article on Wikipedia

HOG article on Wikipedia
Deformable Part-based Model Homepage
Pyramid Match Kernel Homepage
"Video Google" Image Retrieval System

Some Computer Vision datasets:
Caltech-101 Dataset
ImageNet Dataset

To learn about the people mentioned in this article:

Kristen Grauman (creator of Pyramid Match Kernel, Prof at Univ of Texas)
Bill Triggs's (co-creator of HOG, Researcher at INRIA)
Navneet Dalal (co-creator of HOG, now at Google)

Yann LeCun (one of the Titans of Deep Learning, at NYU and Facebook)

Geoffrey Hinton (one of the Titans of Deep Learning, at Univ of Toronto and Google)
Andrew Ng (leading the Deep Learning effort at Baidu, Prof at Stanford)
Yoshua Bengio (one of the Titans of Deep Learning, Prof at U Montreal)

Deva Ramanan (one of the creators of DPM, Prof at UC Irvine)
Pedro Felzenszwalb (one of the creators of DPM, Prof at Brown)
Fei-fei Li (Caltech101 and ImageNet, Prof at Stanford)
Josef Sivic (Video Google and Visual Words, Researcher at INRIA/ENS)
Andrew Zisserman (Geometry-based methods in vision, Prof at Oxford)
Andrew E. Johnson (SPIN Images creator, Researcher at JPL)
Martial Hebert (Geometry-based methods in vision, Prof at CMU)

Sunday, October 26, 2014

VMX is ready

I haven't posted anything here in the last few months, so let me give you guys a brief update. VMX has matured since the Prototype stage last year and the vision.ai team has already started circulating some beta versions of our software.

For those of you who don't remember, last year I decided to leave my super-academic life at MIT and go the startup-route focusing on vision, learning, and automation. Our goal is to make building and deploying vision applications as easy as pie. We want to be the Heroku of computer vision. Personally, I've always wanted to expose the magic of vision to a broader audience. I don't know if the robots of the future are going to have two legs, four arms, or they will forever be airborne -- but I can tell you that these creatures are going to have to perceive the world around them. 2014 is not a bad place to be for a vision company.

VMX, the suite of vision and automation tools which we showcased last year in our Kickstarter campaign, is going live very soon. VMX will be vision.ai's first product. While VMX doesn't do everything vision-related (there's OpenCV for that), it makes training visual object detectors really easy. Whether you're just starting out with vision or AI, have a killer vision-app idea, want to automate more things in your home, you're gonna want to experience VMX yourself.

We will be providing a native installer for Mac OS X as well as single command installer for Linux machines based on Docker. VMX will run on your machine without an internet connection (the download plus all dependencies plus all necessary pre-trained files is approximately 2GB and an activation license will cost between $100 and $1000). The VMX App Builder runs in your browser, is built in AngularJS, and our REST API will allow you to write your own scripts/apps in any language you like. We even have lots of command line examples if you're a curl-y kind of guy/gal. If there's sufficient demand, we'll work on a native Windows installer.

We have been letting some of our close friends and colleagues beta-test our software and we're confident you're going to love it. If you would like to beta-test our software, please sign up on the vision.ai mailing list and send us a beta-key request. We have a limited number of beta-testing keys, so I'm sorry if we don't get back to you. If you want a hands-on demo by one of the VMX creators, we are more than happy to take a hacking break and show off some VMX magic. We can be found in Boston, MA and/or Burlington, VT. If you're thinking of competing in a Hackathon near one of our offices, drop us a line, we'll try to send a vision.ai jedi your way.

Geoff has been championing Docker for the last year and he's done amazing things Dockerizing our build pipeline while I refactored the vision engine API using some ideas I picked up from Haskell, and made considerable performance tweaks to the underlying learning algorithm. I spent a few months toying with different deep network representations, and modernized the internal representation so I can find another deep learning guru to help us out with R&D in 2015.

4 VMXserver processes running on Macbok Pro

We're going to release plenty of pre-trained models plus all the tools and video tutorials you'll need to create your own models from scratch.

We will be offering a $100 personal license and a $1000 professional license of VMX. Beta testers get a personal license in return for helping find installation bugs. Internally, we are at version 0.1.3 of VMX and once we attain 90%+ code coverage we will have VMX 1.0 sometime in early 2015. We typically release stable versions every 1 months and bleeding edge development builds every week.

The future of vision.ai

In the upcoming months, we'll be perfecting our cloud-based deployment platform, so if you're interested in building on top of our vision.ai infrastructure or want to have fun running some massively parallel vision computations with us, just shoot us an email.

Wednesday, May 23, 2012

Why your vision lab needs a reading group

I have a certain attitude when it comes to computer vision research -- don't do it in isolation. Reading vision papers on your own is not enough. Learning how your peers analyze computer vision ideas will only strengthen your own understanding of the field and help you become a more critical thinker. And that is why at places like CMU and MIT we have computer vision reading groups. The computer vision reading group at CMU (also known as MISC-read to the CMU vision hackers) has a long tradition, and Martial Hebert has made sure it is a strong part of the CMU vision culture. Others ex-CMU hackers such as Sanjiv Kumar have continued the vision reading group tradition onto places such as Google Research in NY (correct me if this is no longer the case). I have continued the reading group tradition to MIT (where I'm currently a postdoc) because I was surprised there wasn't one already! In reality, we spend so much time talking about papers in an informal setting, that I felt it was a shame to not do so in a more organized fashion.

Image courtesy of Platypus

My personal philosophy is that as a vision researcher, the way towards the goal of creating novel long-lasting ideas is learning how others think about the field. There's a lot of value in being able to analyze, criticize, and re-synthesize other researchers' ideas. Believe me when I say that a lot of new vision papers come out of top tier vision conferences every year. You should be reading them! But not just reading, also criticizing them among your peers. Because once you learn to criticize others' ideas, you will become better at promulgating your own. Do not equate criticism with nasty words for the sake of being nasty -- good criticism stems from a keen understanding of what must be done in science to convince a broad audience of your ideas.

In case you want to start your own computer vision research group, I've collected some tips, tricks, and advice:

1. You don't need faculty. If you can't find a season vision veteran to help you organize the event, do not worry. You just need 3+ people interested in vision and the motivation to maintain weekly meetings. Who cares if you don't understand every detail of every paper! Nobody besides the authors will ever understand every detail.

2. Be fearless. Ask dumb questions. Alyosha Efros taught me that if you're reading a paper or listening to a presentation, if you don't understand something then there's a good chance you're not the only one in the audience with the same questions. Sometimes younger PhD students are afraid of "asking a dumb question" in front of audience. But if you love knowledge, then it is your duty to ask. Silence will not get you far. Be bold, be curious, and grow wise.

3. Choose your own papers to present. Do not present papers that others want you to present -- that is better left for a seminar course led by a faculty member. In a reading group it is very important that you care about the problems you will be discussing with your peers. If you keep up with this trend then when it comes to "paper writing time" you should be up to date on many relevant papers in your field and you will know about your other lab mates' research interests.

4. It is better to show a paper PDF up on a projector than cancel a meeting. Even if everybody is busy, and the presenter didn't have time to create slides, it is important to keep the momentum going.

5. After a major conference, have all of the people who attended the conference present their "top K paper." The week after CVPR it will be valuable to have such a massive vision brain dump onto your peers because it is unlikely that everybody got to attend.

6. Book a room every week and try to have the meeting at the same time and place. Have either the presenter or the reading group organizer send out an announcement with the paper they will be presenting ahead of time. At MIT we share a google doc with the information about interesting papers and the upcoming presenter usually chooses the paper one week in advance so that the following week's presenter doesn't choose the same paper. If somebody already presents your paper, don't do it a second time! Choose another paper. cvpapers.com is a great resource to find upcoming papers.

At CMU, there is a long rotating schedule which includes every vision student and faculty member. Once it is your time to present, you can only get off the hook if you swap your slot with somebody else. Being on a schedule months in advance means you'll have lots of time to prepare your slides. At MIT, we are currently following the object recognition / scene understanding / object detection theme where we (Prof. Torralba, his students, his postdocs, his visiting students, etc) choose a paper highly relevant to our interests. By keeping such a focus, we can really jump into the relevant details without having to explain fundamental concepts such as SVMs, features, etc. However, at CMU the reading group is much broader because on the queue are students/profs interested in all aspects of vision and related fields such as graphics, illumination, geometry, learning, etc.

Wednesday, October 26, 2011

Google Internship in Vision/ML

Disclaimer: the following post is cross-posted from Yaroslav's "Machine Learning, etc" blog. Since I always rave about my experiences at Google as an intern (did it twice!), I thought some of my fellow readers would find this information useful. If you are a vision PhD student at CMU or MIT, feel free to ask me more about life at Google. If you have questions regarding the following internship offer, you'll have to ask Yaroslav.

Original post at: http://yaroslavvb.blogspot.com/2011/10/google-internship-in-visionml.html

My group has intern openings for winter and summer. Winter may be too late (but if you really want winter, ping me and I'll find out feasibility). We use OCR for Google Books, frames from YouTube videos, spam images, unreadable PDFs encountered by the crawler, images from Google's StreetView cameras, Android and few other areas. Recognizing individual character candidates is a key step in OCR system. One that machines are not very good at. Even with 0 context, humans are better. This shall not stand!

For example, when I showed the picture below to my Taiwanese coworker he immediately said that these were multiple instance of Chinese "one".

Here are 4 of those images close-up. Classical OCR approaches, have trouble with these characters.

This is a common problem for high-noise domain like camera pictures and digital text rasterized at low resolution. Some results suggest that techniques from Machine Vision can help.

For low-noise domains like Google Books and broken PDF indexing, shortcomings of traditional OCR systems are due to
1) Large number of classes (100k letters in Unicode 6.0)
2) Non-trivial variation within classes
Example of "non-trivial variation"

I found over 100k distinct instances of digital letter 'A' from just one day's crawl worth of documents from the web. Some more examples are here

Chances are that the ideas for human-level classifier are out there. They just haven't been implemented and tested in realistic conditions. We need someone with ML/Vision background to come to Google and implement a great character classifier.

You'd have a large impact if your ideas become part of Tesseract. Through books alone, your code will be run on books from 42 libraries. And since Tesseract is open-source, you'd be contributing to the main OCR effort in the open-source community.

You will get a ton of data, resources and smart people around you. It's a very low bureocracy place. You could run Matlab code on 10k cores if you really wanted, and I know someone who has launched 200k core jobs for a personal project. The infrastructure also makes things easier. Google's MapReduce can sort a petabyte of data (10 trillion strings) with 8000 machines in just 30 mins. Some of the work in our team used features coming from distributed deep belief infrastructure.

In order to get an internship position, you must pass general technical screen that I have no control of. If you are interested in more details, you could contact me directly. -- Yaroslav

(the link to apply is usually here, but now it's down, will update when it's fixed)

Friday, September 09, 2011

My first week at MIT: What is intelligence?

In case anybody hasn't heard the news, I am no longer a PhD student at CMU. After I handed in my camera-ready dissertation, it didn't take long for my CMU advisor to promote me from his 'current students' to 'former students' list on his webpage. Even though I doubt there is anyplace in the world which can rival CMU when it comes to computer vision, I've decided to give MIT a shot. I had wanted to come to MIT for a long time, but 6 years ago I decided to choose CMU's RI over MIT's CSAIL for my computer vision PhD. Life is funny because the paths we take in life aren't dead-ends -- I'm glad I had a second chance to come to MIT.

In case you haven't heard, MIT is a little tech school somewhere in Boston. Lots of undergrads can be caught wearing math Tshirts and posters like the following can be found on the walls of MIT:

A cool (undergrad targeted) poster I saw at MIT

As of last week I'm officially a postdoc in CSAIL and I'll be working with Antonio Torralba and Aude Oliva. I've been closely following both Antonio's and Aude's work over the last several years and getting to work with these giants of vision will surely be a treat. In case you don't know what a postdoc is, it is a generic term used to describe post-PhD researchers with generally short term (1-3 year) appointments. People generally use the term Postdocotral Fellow or Postdoctoral Associate to describe their position in a university. I guess 3 years working on vision as an undergrad and 6 years of working on vision as a grad student just wasn't enough for me...

I've been getting adjusted to my daily commute through scenic Boston, learning about all the cool vision projects in the lab, as well as meeting all the PhD students working with Antonio. Today was the first day of a course which I'm sitting-in on, titled "What is intelligence?". When I saw a course offered by two computer vision titans (Shimon Ullman and Tomaso Poggio), I couldn't resist. Here is the information below:

What is intelligence?

"What is intelligence?" course homepage: http://web.mit.edu/9.s915/www/

Class Times:	Friday 11:00-2:00 pm
Units:	3-0-9
Location:	46-5193 (NOTE: we had to choose a bigger room)
Instructors:	Shimon Ullman and Tomaso Poggio

The class was packed -- we had to relocate to a bigger room. Much of today's lecture was given by Lorenzo Rosasco. Lorenzo is the Team Leader of IIT@MIT. Here is a blurb from IIT@MIT's website describe what this 'center' is all about:

The IIT@MIT lab was founded from an agreement between the Massachusetts Institute of Technology(MIT) and the Istituto Italiano di Tecnologia (IIT). The scientific objective is to develop novel learning and perception technologies – algorithms for learning, especially in the visual perception domain, that are inspired by the neuroscience of sensory systems and are developed within the rapidly growing theory of computational learning. The ultimate goal of this research is to design artificial systems that mimic the remarkable ability of the primate brain to learn from experience and to interpret visual scenes.

Another cool class offered this semester at MIT is Antonio Torralba's Grounding Object Recognition and Scene Understanding.