Tuesday, December 13, 2011

learning to "borrow" examples for object detection. Lim et al, NIPS 2011

Let's say you want to train a cat detector...  If you're anything like me, then you probably have a few labeled cats (~100), as well as a source of non-cat images (~1000).  So what do you do when you can't get any more labeled cats?  (Maybe Amazon's Mechanical Turk service was shut down by the feds, you've got a paper deadline in 48 hours, and money can't get you out of this dilemma.)

Answer: 
1) Realize that there are some labeled dogs/cows/sheep in your dataset!
2) Transform some of the dogs/cows/sheep in your dataset to make them look more like cats. Maybe some dogs are already sufficiently similar to cats! (see cheezburger.com image below)
3) Use a subset of those transformed dogs/cows/sheep examples as additional positives in your cat detector!

Some dogs just look like cats! (and vice-versa)


Using my own internal language, I view this phenomenon as "exemplar theft."  But not the kind of theft which sends you to prison, 'tis the kind of theft which gives you best-paper prizes at your local conference.

Note that this was the answer provided by the vision hackers at MIT in their most recent paper, "Transfer Learning by Borrowing Examples for Multiclass Object Detection," which was just presented at this year's big machine learning-oriented NIPS conference, NIPS 2011. See the illustration from the paper below, which depicts this type of "example borrowing"-sharing for some objects in the SUN09 dataset.


The paper empirically demonstrates that instead of doing transfer learning (also known as multi-task learning) the typical way (regularizing weight vectors towards each other), it is beneficial to simply borrow a subset of (transformed) examples from a related class.  Of course the problem is that we do not know apriori which categories to borrow from, nor which instances from those categories will give us a gain in object detection performance.  The goal of the algorithm is to learn which categories to borrow from, and which examples to borrow.  Not all dogs will help the cat detector.

Here are some examples of popular object categories, the categories from which examples are borrowed, and the categories from which examples are shared once we allow transformations to happen.  Notice the improvement in AP (the higher the average precision the better) when you allow sharing.



They also looked at what happens if you want to improve a single category badass detector on one particular dataset, such as the PASCAL VOC.  Note that these days just about everybody is using the one-and-only "badass detector" and trying to beat it in its own game.   These are the different ways you'll hear people talk about the Latent-SVM-based Deformable Part Model baseline. "badass detector"="state-of-the-art detector"="Felzenszwalb et al. detector"="Pedro's detector"="Deva's detector","Pedro/Deva detector","LDPM detector","DPM detector"

Even if you only care about your favourite dataset, such as PASCAL VOC, you're probably willing to use additional positive data points from another dataset.  In their NIPS paper, the MIT hackers show that simply concatenating datasets is inferior to their clever example borrowing algorithm (mathematical details are found in the paper, but feel free to ask me detailed questions in the comments).  In the figure below, the top row shows cars from one dataset (SUN09), the middle row shows PASCAL VOC 2007 cars, and the bottom row shows which example the SUN09-car detector wants to borrow from PASCAL VOC.

Here the the cross-dataset generalization performance on the SUN09/PASCAL duo.  These results were inspired by the dataset bias work of Torralba and Efros.



In case you're interested, here is the full citation for this excellent NIPS2011 paper:

Joseph J. Lim, Ruslan Salakhutdinov, and Antonio Torralba. "Transfer Learning by Borrowing Examples for Multiclass Object Detection," in NIPS 2011. [pdf]




To get a better understanding of Lim et al's paper, it is worthwhile going back in time to CVPR2011 and taking a quick look the following paper, also from MIT:

Ruslan Salakhutdinov, Antonio Torralba, Josh Tenenbaum. "Learning to Share Visual Appearance for Multiclass Object Detection," in CVPR 2011. [pdf]

Of course, these authors need no introduction (they are all professors at big-time institutions). Ruslan just recently became a Professor and is now back on home turf (where he got his PhD) in Toronto, where he is likely to become the next Hinton.  In my opinion, this "Learning to share" paper was one of the best papers of CVPR 2011.  In this paper they introduced the idea of sharing across rigid classifier templates, and more importantly learning a tree to organize hundreds of object categories.  The tree defines how the sharing is supposed to happen.  The root note is global and shared across all categories, the mid-level nodes can be interpreted as super-categories (i.e., animal, vehicle), and the leaves are the actual object categories (e.g., dog, chair, person, truck).

The coolest thing about the paper is that they use a CRP (chinese restaurant process) to learn a tree without having to specify the number of super-categories!

Finally, we can see some learned weights for three distinct object categories: truck, van, and bucket.  Please see the paper if you want to learn more about sharing -- the clarity of Ruslan's paper is exceptional.




In conclusion, it is pretty clear everybody wants some sort of visual memex. (It is easy to think of the visual memex as a graph where the nodes are individual instances and the edges are relationships between these entities)  Sharing, borrowing, multi-task regularization, exemplar-svms, and a host of other approaches are hinting at the breakdown of the traditional category-based way of approaching the problem of object recognition.  However, our machine learning tools were designed for supervised machine learning with explicit class information.  So what we, the researchers do, is try to break down those classical tools so that we can more effectively exploit the blurry line between not-so-different object categories.  At the end of the day, rigid categories can only get us so far.  Intelligence requires interpretation at multiple and potentially disparate levels.  When it comes to intelligence, the world is not black and white, there are many flavours of meaningful image interpretation.

Tuesday, December 06, 2011

Graphics meets Big Data meets Machine Learning

We've all played Where's Waldo as children, and at least for me it was quite a fun game.  So today let's play an image-based Big Data version of Where's Waldo.  I will give you a picture, and you have to find it in a large collection of images!  This is a form of image retrieval, and this particular formulation is also commonly called "image matching."


The only catch is that you are only given one picture, and I am free to replace the picture with a painting or a sketch.  Any two-dimensional pattern is a valid query image, but the key thing to note is that there is only a single input image. Life would be awesome if Google's Picasa had this feature built in!


The classical way of solving this problem is via a brute-force nearest neighbor algorithm, an algorithm which won't match pixel pattern directly, but an algorithm which will also use a state-of-the-art image descriptor such as GIST for comparison.  Back in 2007, at SIGGRAPH, James Hays and Alexei Efros have shown this to work quite well once you have a very large database of images!  But the reason why the database had to be so large is because a naive Nearest Neighbor algorithm is actually quite dumb.  The descriptor might be cleverer than matching raw pixel intensities, but for a machine, an image is nothing but a matrix of numbers, and nobody told the machine which patterns in the matrix are meaningful and which ones aren't.  In short, the brute-force algorithm works if there are similar enough images such that all parts of the input image will match a retrieved image.  But ideally we would like the algorithm to get better matches by automatically figuring out which parts of the query image are meaningful  (e.g., the fountain in the painting) and which parts aren't (e.g., the reflections in the water).

A modern approach to solve this issue is to collect a large set of related "positive images" and a large set of un-related "negative images" and then train a powerful classifier which can hopefully figure out the meaningful bits of the image. But in this approach the problem is twofold.  First, working with a single input image it is not clear whether standard machine learning tools will have a chance of learning anything meaningful.  The second issue, a significantly worse problem, is that without a category label or tag, how are we supposed to create a negative set?!?  Exemplar-SVMs to the rescue!  We can use a large collection of images from the target domain (the domain we want to find matches from) as the negative set -- as long as the "negative set" contains only a small fraction of potentially related images, learning a linear SVM with a single positive still works.




Here is an excerpt from a Techcrunch article which summarizes the project concisely:

"Instead of comparing a given image head to head with other images and trying to determine a degree of similarity, they turned the problem around. They compared the target image with a great number of random images and recorded the ways in which it differed the most from them. If another image differs in similar ways, chances are it’s similar to the first image. " -- Techcrunch


Abhinav ShrivastavaTomasz MalisiewiczAbhinav GuptaAlexei A. EfrosData-driven Visual Similarity for Cross-domain Image Matching. In SIGGRAPH ASIA, December 2011. Project Page



Here is a short listing of some articles which mention our research (thank Abhinav!).




Monday, December 05, 2011

An accidental face detector

Disclaimer #1: I don't specialize in faces.  When it comes to learning, I like my objectives to be convex.  When it comes to hacking on vision systems, I like to tackle entry-level object categories.

Fun fact #1: Faces are probably the easiest objects in the world for a machine to localize/detect/recognize.

Note #1: I supplied the images, my algorithm supplied the red boxes.

Note #2: Sorry to all my friends who failed to get detected by my accidental face detector! (see below)

So I was hackplaying with some of my PhD thesis code over Thanksgiving, and I accidentally made a face detector.  oops!  I immediately ran to my screenshot capture tool and ran my code on my Mac desktop while browsing Google Images and Facebook.  It seems to work pretty well on real faces as well as sketches/paintings of faces (see below)!  I even caught two Berkeleyites (an Alyosha and a Jianbo), but you gotta find them for yourself.  The detector is definitely tuned to frontal faces, but runs pretty fast and produces few false positives.  Not too shabby for some midnight hackerdom.










Yes, I'm doing dense multiscale sliding windows here.  Yes, I'm HoGGing the hell outta these images. Yes, I'm using a single frontal-face tuned template.  And yes, I only used faces of myself to train this accidental face detector.

Note: If I've used one of your pictures without permission, and you would like a link back to your home on the interwebs, please leave a comment indicating the image and link to original.



Friday, December 02, 2011

Google Scholar, My Citations, a new paradigm for finding great Computer Vision research papers

I have been finding great computer vision research papers by using Google Scholar for the past 2+ years.  My recipe is straightforward and has two key ingredients. First, by finding new papers that cite one of my published papers, I automatically get to read papers which will be relevant to my own research interests.  The best bit is that by using Google Scholar, I'm not limiting my search to a single conference -- Google finds papers from the raw web.

Second, I have a short list of superstar vision researchers (Jitendra Malik, among others) and I basically read anything and everything these gurus publish.  Regularly visiting academic homepages is the best way to do this, but Google Scholar also lets me search by name.  In addition, nobody lists on their homepage their papers' citation counts.  This means if I visit a researcher's personal website, I have to make a decision as to what paper to read based on (title, co-authors, publication venue).  But highly-cited papers are likely to be more important to read first.  I believe that this is a good rule of thumb, and very important if you are new to the field.

I am really glad that Google finally let researchers make public profiles to view their papers and see their citations, etc.  See Google Scholar blog for more information.  I've been using statcounter to monitor my blog's visitors, and now I can use Google Scholar to monitor who is citing my research papers!  I'm not claiming that the only way for me to read one of  your papers is to cite one of my papers, but believe me, even if we never met at a vision conference, if you cited one one my papers there's a good chance I already know about your research :-)  I would love to see Google Scholar Citations pages one day replace "my publications" sections on academic homepages...



My Citations screenshot



My only complaint with Google Scholar is that I can't seem to get it to recognize my two most recent papers.  I have these papers listed on my homepage, so do my co-authors, but Google isn't picking them up!!!  I manually added them to my Google My Citations page, and using Google Scholar I was able to find at least one other paper which cites on of these two papers.

I read the inclusion guidelines, and I'm still baffled.  The PDFs are definitely over 5MB, but my older papers which were indexed by Google were also over 5MB.  Dear Google, are you seriously not indexing my recent papers because they are over 5MB?  It takes us, researchers, months of hard work to get our work out the door.  We see the sun rise for weeks straight when we are in deadline-mode, and the conferences/journals give us size limitations -- we work hard to make our stuff fit within these limits (something like 20MB per PDF).  And we, researchers, are crazy about Google and what it means for organizing the world's information -- naturally we are jumping on the Google Scholar bandwagon. I really hope there's some silly reason why I can't find my own papers using Google Scholar, but if I can't find my own work, that means others can't find my own work, and until I can be confident that Google Scholar is bug-free, I cannot give it my full recommendation.

Problematic papers for Google Scholar:

Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Data-driven Visual Similarity for Cross-domain Image Matching. In SIGGRAPH ASIA, December 2011.

Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Ensemble of Exemplar-SVMs for Object Detection and Beyond. In ICCV, November 2011.

If anybody has any suggestions (there's a chance I'm doing something wrong), or an explanation as to why my papers haven't been indexed, I would love to hear from you.

Wednesday, November 16, 2011

don't throw away old code: github-it!

My thesis experiments on Exemplar-SVMs (my PhD thesis link: Note, 33MB) would have taken approximately 20 CPU years to finish.  But not on a fat CMU cluster!  Here is some simple code which helped make things possible in ~1month of 200+ cores of crunching.  That scale of computation is not quite Google-scale computing, but it was a unforgettable experience as a CMU PhD student.  I've recently had to go back to the SSH / GNU Screen method of starting scripts at MIT, since we do not have torque/pbs there, but I definitely use these scripts.  Fork it, use it, change it, hack it, improve it, break it, learn from it, etc.

https://github.com/quantombone/warp_scripts

I used these scripts to drive the experiments in my Exemplar-SVM framework (also on Github).


The basic take home message is "do not throw away old code" which you found useful at some time.  C'mon ex-phd students, I know you wrote a lot of code, you graduated and now you feel embarrassed to share your code.  Who cares if you never had a chance to clean it up, if the world never gets to see it then it will die a silent death from lack of use.  Just put it on Github, and let others take a look.  Git is the world's best source control/versioning system. Its distributed nature makes it perfect for large-scale collaboration.  Now with github sharing is super easy! Sharing is caring.  Let's make the world a better place for hackerdom, one repository at a time.  I've met some great hackers at MIT, such as the great cvondrick, who is still teaching me how to branch like a champ.

Mathematicians share proofs.  Hackers share code.  Embrace technology, embrace Github.  If you ever want to hack with me, it is probably as important for you to know the basics of git as it is for you to be a master of linear algebra.

Additional Reading:
Distributed Version Control: The Future of History, an article about Git by some Kitware software engineers


Tuesday, November 08, 2011

scene recognition with part-based models at ICCV 2011


Today, I wanted to point everyone's attention to a super-cool paper from day 1 of this year's ICCV 2011 Conference.  Megha Pandey is the lead on this, and Lana Lazebnik (of spatial pyramid fame) is the seasoned vision community member supervising this research.  The idea is really simple (and simplicity is a plus!): train a latent deformable part-based model for scenes.  Some of the scene models look really cool, and I encourage everybody interested in scene recognition to take a look.  

A Part-based Scene Model


One of the reasons why I like this paper is because just like our SIGGRAPH ASIA 2011 paper on cross-domain image matching, they are using HOG features to represent scenes and applying these models in a sliding-window fashion.  This is much different than the traditional image-to-feature-vector mapping used in systems based on the GIST descriptor.  These types of approaches allow the detection of a scene inside another image!  Framing issues are elegantly handled by allowing the model to slide.


Abstract: Weakly supervised discovery of common visual structure in highly variable, cluttered images is a key problem in recognition. We address this problem using deformable part-based models (DPM’s) with latent SVM training. These models have been introduced for fully supervised training of object detectors, but we demonstrate that they are also capable of more open-ended learning of latent structure for such tasks as scene recognition and weakly supervised object localization. For scene recognition, DPM’s can capture recurring visual elements and salient objects; in combination with standard global image features, they obtain state-of-the-art results on the MIT 67-category indoor scene dataset. For weakly supervised object localization, optimization over latent DPM parameters can discover the spatial extent of objects in cluttered training images without ground-truth bounding boxes. The resulting method outperforms a recent state-of-the-art weakly supervised object localization approach on the PASCAL-07 dataset.



Weakly Supervised Object Localization (see paper for details)


Saturday, November 05, 2011

the fun begins at ICCV 2011

All the cool vision kids are going, so why aren't you?
http://www.iccv2011.org/

This will be my first ICCV ever! and my first trip to Spain!  Seriously though, if you need to find me over the next week, come to Barcelona. There are lots of great papers out this year and I'll be sure the write about the few which I find interesting (and haven't already blogged about).  If you want to learn more about the craziness behind ExemplarSVMs, or just to say 'Hi' don't hesitate to find me walking around the conference.  I'll be there during all the workshop days too.

If anybody has a favourite ICCV2001 paper they want me to look and perhaps write something about (hardcore object recognition please -- I don't care about illumination models), please send me your requests (in the comments below).

Wednesday, October 26, 2011

Google Internship in Vision/ML

Disclaimer: the following post is cross-posted from Yaroslav's "Machine Learning, etc" blog. Since I always rave about my experiences at Google as an intern (did it twice!), I thought some of my fellow readers would find this information useful.  If you are a vision PhD student at CMU or MIT, feel free to ask me more about life at Google.  If you have questions regarding the following internship offer, you'll have to ask Yaroslav.

Original post at: http://yaroslavvb.blogspot.com/2011/10/google-internship-in-visionml.html


My group has intern openings for winter and summer. Winter may be too late (but if you really want winter, ping me and I'll find out feasibility). We use OCR for Google Books, frames from YouTube videos, spam images, unreadable PDFs encountered by the crawler, images from Google's StreetView cameras, Android and few other areas. Recognizing individual character candidates is a key step in OCR system. One that machines are not very good at. Even with 0 context, humans are better. This shall not stand!

For example, when I showed the picture below to my Taiwanese coworker he immediately said that these were multiple instance of Chinese "one".



Here are 4 of those images close-up. Classical OCR approaches, have trouble with these characters.



This is a common problem for high-noise domain like camera pictures and digital text rasterized at low resolution. Some results suggest that techniques from Machine Vision can help.

For low-noise domains like Google Books and broken PDF indexing, shortcomings of traditional OCR systems are due to
1) Large number of classes (100k letters in Unicode 6.0)
2) Non-trivial variation within classes
Example of "non-trivial variation"


I found over 100k distinct instances of digital letter 'A' from just one day's crawl worth of documents from the web. Some more examples are here

Chances are that the ideas for human-level classifier are out there. They just haven't been implemented and tested in realistic conditions. We need someone with ML/Vision background to come to Google and implement a great character classifier.

You'd have a large impact if your ideas become part of Tesseract. Through books alone, your code will be run on books from 42 libraries. And since Tesseract is open-source, you'd be contributing to the main OCR effort in the open-source community.

You will get a ton of data, resources and smart people around you. It's a very low bureocracy place. You could run Matlab code on 10k cores if you really wanted, and I know someone who has launched 200k core jobs for a personal project. The infrastructure also makes things easier. Google's MapReduce can sort a petabyte of data (10 trillion strings) with 8000 machines in just 30 mins. Some of the work in our team used features coming from distributed deep belief infrastructure.


In order to get an internship position, you must pass general technical screen that I have no control of. If you are interested in more details, you could contact me directly.  -- Yaroslav

(the link to apply is usually here, but now it's down, will update when it's fixed)

Tuesday, October 25, 2011

NIPS 2011 preview: person grammars and machines-in-the-loop for video annotation

Object Detection with Grammar Models
To appear in NIPS 2011 pdf

Today, I want to point out two upcoming NIPS papers which might be of interest to the Computer Vision community.  First, we have a person detection paper from the hackers who brought you Latent Discriminatively Trained Part-based Models (aka voc-release-3.1 and voc-release-4.0).  I personally don't care for grammars (I think exemplars are a much more data-driven and computation-friendly way of modeling visual concepts), but I think any paper with Pedro on the author list is really worth checking out.  Maybe after I digest all the details, I'll jump on the grammar bandwagon (but I doubt it).  Also of note, is the fact that Pedro Felzenszwalb has relocated to Brown University.



The second paper, is by Carl Vondrick and Deva Ramanan (also of latent-svm fame).  Carl is the author of vatic and a fellow vision@github hacker.  Carl, like myself, has joined Antonio Torralba's group at MIT this fall.  He just started his PhD, so you can only expect the quality of his work to increase without bound over the next ~5 years.  vatic is an online, interactive video annotation tool for computer vision research that crowdsources work to Amazon's Mechanical Turk. Vatic makes it easy to build massive, affordable video data sets and can be deployed on a cloud. Written in Python + C + Javascript, vatic is free and open-source software. The video below showcases the power of vatic.



In this paper, Vondrick et al. use active learning to select the frames which require human annotation.  Rather than simply doing linear interpolation between frames, they are truly putting the "machine-in-the-loop." When doing large-scale video annotation, this approach can supposedly save you tens of thousands of dollars.

Carl Vondrick and Deva Ramanan. "Video Annotation and Tracking with Active LearningNeural Information Processing Systems (NIPS) Granada, Spain, December 2011. [paper] [slides]

Thursday, October 06, 2011

Kinect Object Datasets: Berkeley's B3DO, UW's RGB-D, and NYU's Depth Dataset

Why Kinect?
The Kinect, made by Microsoft, is starting to become quite a common item in Robotics and Computer Vision research.  While the Robotics community has been using the Kinect as a cheap laser sensor which can be used for obstacle avoidance, the vision community has been excited about using the 2.5D data associated with the Kinect for object detection and recognition.  The possibility of building object recognition systems which have access to pixel features as well as 2.5D features is truly exciting for the vision hacker community!

Berkeley's B3DO
First of all, I would like to mention that it looks like the Berkeley Vision Group jumped on the Kinect bandwagon.  But the data collection effort will be crowdsourced -- they need your help!  They need you to use your Kinect to capture your own home/office environments and upload it to their servers  This way, a very large dataset will be collected, and we, the vision hackers, can use machine learning techniques to learn what sofas, desks, chairs, monitors, and paintings look like.  They Berkeley hackers have a paper on this at one of the ICCV 2011 workshops in Barcelona, here is the paper information:



A Category-Level 3-D Object Dataset: Putting the Kinect to Work
Allison JanochSergey KarayevYangqing JiaJonathan T. BarronMario FritzKate SaenkoTrevor Darrell
ICCV-W 2011
[pdf] [bibtex]


UW's RGB-D Object Dataset
On another note, if you want to use 3D for your own object recognition experiments then you might want to check out the following dataset: University of Washington's RGB-D Object Dataset.  With this dataset you'll be able to compare against UW's current state-of-the-art.




In this dataset you will find RGB+Kinect3D data for many household items taken from different views.  Here is the really cool paper which got me excited about the RGB-D Dataset:
A Scalable Tree-based Approach for Joint Object and Pose Recognition
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox
In the Twenty-Fifth Conference on Artificial Intelligence (AAAI), August 2011.



NYU's Depth Dataset
I have to admit that I did not know about this dataset (created by by Nathan Silberman of NYU), until after I blogged about the other two datasets.  Check out the NYU Depth Dataset homepage. However the internet is great, and only a few hours after posted this short blog post, somebody let me know that I left out this really cool NYU dataset.  In fact, it looks like this particular dataset might be at the LabelMe-level regarding dense object annotations, but with accompanying Kinect data.  Rob Fergus & Co strike again!


Nathan Silberman, Rob Fergus. Indoor Scene Segmentation using a Structured Light Sensor. To Appear: ICCV 2011 Workshop on 3D Representation and Recognition

Thursday, September 29, 2011

plenoptica theoretica: fields vs. particles

"All bodies together, and each by itself, give off to the surrounding air an infinite number of images which are all-pervading and each complete, each conveying the nature, colour and form of the body which produces it." --Leonardo da Vinci

Yesterday, Edward H. Adelson ("Ted Adelson") gave a lecture at MIT on the plenoptic function and its role in understanding (and unifying) early vision.  Ted has been at MIT for quite some time.  He is sometimes described as being (1/3 human vision, 1/3 computer vision, and 1/3 computer graphics) and was Bill Freeman's advisor.

What is the plenoptic function?
Etymology: Plenoptic comes from plenus+optic.
plenus: full, filled
optic: relating to eye or vision

Ted Adelson imagined a sort of unified field theory for vision -- instead of proposing a jungle of atoms such as edges, corners, and peaks, the plenoptic function offers a unifying principle under which color, texture, motion, etc. can all be viewed as gradients of the plenoptic function.  The plenoptic function is a complete representation which contains, implicitly, a description of every possible photograph that could be taken of a particular space-time chunk of the world.  Omniscience is to knowing as the plenoptic function is to seeing.

Ted remarked that if you asked him 20 years ago what he was working on in vision, you might have gotten a confusing answer.  "Do you work on texture, motion, stereo, or illumination?" you might ask.  "All of them.  Aren't they the all the same thing?" he might reply.  Ted argues that vision scientists in the 80s and early 90s tried to cut up the world of vision into neat little "particles" and would develop theories with their favorite particle -- here the particles are early vision concepts such as color, texture, and motion.

In their seminal paper on the plenoptic function, The Plenoptic Function and the Elements of Early Vision, Adelson and Bergen state that "the elemental operations of early vision involve the measurement of local change long various directions within the plenoptic function."  As a theoretical device, the plenoptic function has left a long-standing impression on me.  I first came across Ted's ideas back in 2006 -- thanks to Alyosha Efros' course on vision.  Having just completed a BS in Physics, I was well aware of unified field theories in physics, and the plenoptic function seemed too cool to forget.

What the plenoptic function means to me
However, if the plenoptic function is the Maxwell's equations equivalent for early (low-level) vision, then what I'm ultimately after is the Schrodinger's equation of late (high-level) vision.  In his lecture, Ted Adelson acknowledged that vision scientists have a sort of Atom Envy -- they envy the physicists who are able to understand the world in terms of a few fundamental ontologically meaningful entities.  First of all, I like particles, but I have no apriori reason to be in the particle camp all of my life.  Secondly, the plenoptic function was all about early vision, but my research in vision is all about high-level vision such as object recognition.  I might be young and foolish, but the search for a "mind mechanics" has been a part of my research life (at least partially) since ~2003.  Right now, my best shot at an answer is that exemplars and associations are the basic building blocks of high-level vision -- but unlike the British Empricisits (the champions of associationism), I would argue that the atomic building blocks of associations are object instances, and not ideas such as "roundness" and "blueness".  Complex ideas are then the object categories which arise out of the interactions between these concrete elements of experience.

Conclusion
The Adelson and Bergen paper is a must read for anybody serious about vision.  While it might not offer much in terms of "what next" in vision research, it is nevertheless a useful construct in thinking about vision.  I get excited when it comes down to unifying principles and I wish there were more papers like this in vision, especially for high-level vision.

Wednesday, September 28, 2011

Kant's Intuitions, the intentional stance, and reverse-engineering the mind


“Thoughts without content are empty, intuitions without concepts are blind. The understanding can intuit nothing, the senses can think nothing. Only through their unison can knowledge arise.” -- Immanuel Kant

“We live in a world that is subjectively open. And we are designed by evolution to be "informavores", epistemically hungry seekers of information, in an endless quest to improve our purchase on the world, the better to make decisions about our subjectively open future.” -- Daniel Dennett

"For scientists studying how humans cometo understand their world, the central challenge is this: How do our minds get so much from so little? We build rich causal models,make strong generalizations, and construct powerful abstractions, whereas the input data are sparse, noisy, and ambiguous—in every way far too limited. A massive mismatch looms between the information coming in through our senses and the outputs of cognition." -- Josh Tenenbaum

Organizing by space (space, time, and physics)
There are two faculties of understanding which it is unlikely we have acquired from experience.  The first is that of understanding objects as extended bodies in a 3D space and thus occupying some volume.  I believe it is Kant argued best against the hardcore British Empiricists, who proclaimed that experience is the sole originator of knowledge.  Experiences are the pen strokes, which fill the Empricisit’s tabula rasa.  Kant argued (against Hume) that the concept of a spatially extended object is not acquired from experience – the very notion of experience requires that we already possess the notion of an object in order to have a meaningful percept.  It is as if the Empiricists failed to acknowledge that to make strokes on a sheet of paper, we need to already have a pen.  Kant’s intuitions are the pens of experience.  The requirement of having suitable intuitions for grouping percepts into experiences is what Kant described as a form of transcendental idealism.  “Objectness” is a faculty of human understanding, not something acquired from experience.  If you are a vision researcher, being aware of this can have drastic implications on your research programme.

It has also been argued that there are some primitive notions of object dynamics, aka folk-physics, which can are possessed by very young children.  Given the uniformity of human experience (at least I have no ostensible reason to double that my colleague’s experiences significantly differ from my own), and the diversity in our individual upbringing, it is also unlikely that folk-physics is learned from experience.  However, I don't want to make any strong claims regarding folk-physics.  I feel safe to say that Quantum Mechanics is another story -- it requires years of mathematics and thousands hours of deliberate problem solving to grasp.

Organizing by mind (psychology, mind, and intent)
The second faculty of understanding, which can be found in many aspects of human intelligence, is that of understanding the world in terms of cognitive agents.  Humans have an amazing capability when it comes to attributing stuff with having a mind.  This way of thinking about the world is so common and uniform among children all over the world, that the differences in their upbringing cannot be reconciled with the uniformity of their capability to project humanness onto objects.  Consider the following video (thanks J. Tenenbaum's videos/lectures for pointing this out).


We cannot just view this video os triangles, dots, and lines.  Each one of understands the story in terms of a narrative based on agents and their intent.  We are stimulated by the external world, we take as input sense-data, and the brain helps us make sense of it -- it turns the hodgepodge of data into experience.  But the brain is a mold, it conforms percepts to some shape defined by the mold.  These molds are the faculties of understanding which let us understand things, it is like the faculties of understanding are basis vectors onto which we project all input sense data.  The data is weak and noisy, the priors are strong, and understanding is the result of their union.  An experience without a proper basis is blind, it is just a ball of percepts.  These faculties allow us to have experience.  The experiences, coupled with memory, allow us to obtain understanding – where understanding is the relationship between a given experience and past experiences, either in the form of direct associations between currently-experienced-objects and previously-experienced-objects, or rules abstracted away from previously-experienced-objects being directly applied to the current sense data.

What I am talking about is what philosopher Daniel Dennett refers to by the “intentional stance.”  Given my background in AI and philosophy of mind, it is very likely that Dennett and I have had the same influences.  I like to juxtapose my ideas with those of the classical philosophers such as Descartes, Locke, Kant, Wittgenstein and Pinker -- I’m not sure how Dennett motivates his philosophy nor do I know against whose ideas he juxtaposes his own stance.



At MIT, J. Tenenbaum is pushing these ideas to the next level.  I only wish there was more perception in his work -- toy worlds just don't do it for me.  I want to build intelligent machines, and really cannot afford to sidestep the issue of perception.  Here is a great talk by Josh Tenenbaum on reverse engineering the mind from NIPS 2010. Video is on videolectures.net, just click the link.


Implications for Artificial Intelligence and Machine Vision
Following Josh Tenenbaum, I think that a criticism of classical machine learning is long overdue.  Machine Learning, as a field, has been spewing out hardcore empiricists.  “Let me download your features, my machine learning algorithm will take care of the rest,” they say.  It is like the glory is in the mathematics, which manipulates N-D vectors.  But I argue that intelligence isn’t “in the calculus,” it is what the primitives in the calculus actually represent.  As an undergraduate I proclaimed, “I am not a mathematician, I am a physicists.  I care about the structure of the world, not the structure of proofs. “  As a graduate student I proclaimed, “The glory isn’t in the manipulation of vectors, the glory is understanding the what/why of encoding information about the world into vectors.  I am a computer vision researcher, not a machine learning researcher.”  That is why the view of the world as coming from K different classes is wrong – this is merely a convenient view if the statistician’s toolbox is at your disposal.  It is all about structuring the input to match a researcher’s high-level intuitions about the world. 

Friday, September 09, 2011

My first week at MIT: What is intelligence?

In case anybody hasn't heard the news, I am no longer a PhD student at CMU.  After I handed in my camera-ready dissertation, it didn't take long for my CMU advisor to promote me from his 'current students' to 'former students' list on his webpage.  Even though I doubt there is anyplace in the world which can rival CMU when it comes to computer vision,  I've decided to give MIT a shot.  I had wanted to come to MIT for a long time, but 6 years ago I decided to choose CMU's RI over MIT's CSAIL for my computer vision PhD.  Life is funny because the paths we take in life aren't dead-ends -- I'm glad I had a second chance to come to MIT.


In case you haven't heard, MIT is a little tech school somewhere in Boston.  Lots of undergrads can be caught wearing math Tshirts and posters like the following can be found on the walls of MIT:


A cool (undergrad targeted) poster I saw at MIT



As of last week I'm officially a postdoc in CSAIL and I'll be working with Antonio Torralba and Aude Oliva. I've been closely following both Antonio's and Aude's work over the last several years and getting to work with these giants of vision will surely be a treat.  In case you don't know what a postdoc is, it is a generic term used to describe post-PhD researchers with generally short term (1-3 year) appointments.  People generally use the term Postdocotral Fellow or Postdoctoral Associate to describe their position in a university. I guess 3 years working on vision as an undergrad and 6 years of working on vision as a grad student just wasn't enough for me...


I've been getting adjusted to my daily commute through scenic Boston, learning about all the cool vision projects in the lab, as well as meeting all the PhD students working with Antonio. Today was the first day of a course which I'm sitting-in on, titled "What is intelligence?".  When I saw a course offered by two computer vision titans (Shimon Ullman and Tomaso Poggio), I couldn't resist.  Here is the information below:

What is intelligence?



Class Times:Friday 11:00-2:00 pm
Units:3-0-9
Location:46-5193 (NOTE: we had to choose a bigger room)
Instructors:Shimon Ullman and Tomaso Poggio

The class was packed -- we had to relocate to a bigger room.  Much of today's lecture was given by Lorenzo Rosasco. Lorenzo is the Team Leader of IIT@MIT. Here is a blurb from IIT@MIT's website describe what this 'center' is all about:

The IIT@MIT lab was founded from an agreement between the Massachusetts Institute of Technology(MIT) and the Istituto Italiano di Tecnologia (IIT). The scientific objective is to develop novel learning and perception technologies – algorithms for learning, especially in the visual perception domain, that are inspired by the neuroscience of sensory systems and are developed within the rapidly growing theory of computational learning. The ultimate goal of this research is to design artificial systems that mimic the remarkable ability of the primate brain to learn from experience and to interpret visual scenes.


Another cool class offered this semester at MIT is Antonio Torralba's Grounding Object Recognition and Scene Understanding.


Wednesday, August 24, 2011

The vision hacker culture at Google ...

I sometimes get frustrated when developing machine learning algorithms in C++.  And since working in object recognition basically means you have to be a machine learning expert, trying something new and exciting in C++ can be extremely painful.  I don't miss the C++ heavy workflow for vision projects at Google.  C++ is great for building large-scale systems, but not for pioneering object recognition representations.  I like to play with pixels and I like to think of everything as matrices.  But programming languages, software engineering philosophies, and other coding issues aren't going to be today's topic.  Today I want to talk about the one thing that is more valuable that is computers, and that is people.  Not just people, but a community of people, and in particular the culture at Google -- in particular, vision@Google.


I miss being around the hacker culture at Google.  

The people at Google aren't just hackers, they are Jedis when it comes to building great stuff -- and that is why I recommend a Google internship to many of my fellow CMU vision Robograds (fyi, Robograds are CMU Robotics Graduate Students).  CMU-ers, like Googlers, like to build stuff.  However, CMU-ers are typically younger.


What is a software engineering Jedi, you might ask? Tis' one who is not afraid of million cores, one who is not afraid of building something great.  While little boys get hurt by the guns 'n knives of C++, Jedi use their tools like ninjas use their swords. You go into Google as a boy, you come out a man.  NOTE: I do not recommend going to Google and just toying around in Matlab for 3 months.  Build something great, find a Yoda-esque mentor, or at least strive to be a Jedi.  There's plenty of time in grad school for Matlab and writing papers.  If you get a chance to go to Google, take the opportunity to go large-scale and learn to MapReduce like the pros.

Every day I learn about more and more people I respect in vision and learning going to Google, or at least interning there (e.g., Andrej Karpathy who is starting his PhD@Stanford and Santosh Divvala who is a well-known CMU PhD student and vision hacker).  And I really can't blame them for choosing Google over places like Microsoft for the summer.  I can't think of many better places to be -- the culture is inimitable.  I spent two summers at Jay Yagnik's group some of the great people I interned with are already full-time Googlers (e.g. Luca Bertelli and Mehmet Emre Sargin).  And what is really great about vision@google is that these guys get to publish surprisingly often!  Not just throw-away-code kind of publish, but stuff that fits inside large-scale systems -- stuff which is already inside Google products.  The technology is often inside the Google product before the paper goes public!  Of course it's not easy to publish at a place like Google because there is just way too much exciting large-scale stuff going on.  Here is a short list of some cool 2010/2011 vision papers (from vision conferences) with significant Googler contributions.



Kernelized Structural SVM Learning

“Kernelized Structural SVM Learning for Supervised Object Segmentation”, Luca BertelliTianli Yu, Diem Vu, Burak Gokturk, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2011.
[abstract] [pdf]


Finding Meaning on YouTube


“Finding Meaning on YouTube: Tag Recommendation and Category Discovery”, George Toderici,Hrishikesh AradhyeMarius Pasca, Luciano Sbaiz, Jay YagnikComputer Vision and Pattern Recognition, 2010.
[abstract] [pdf]




Here is a very exciting and new paper from SIGGRAPH 2011.  It is a sort of Visual Memex for faces -- congratulations on this paper, guys!  Check out the video below.

Exploring Photobios Movie







Exploring Photobios from Ira Kemelmacher on Vimeo




Ira Kemelmacher-Shlizerman, Eli Shechtman, Rahul Garg, Steven M. Seitz. "Exploring Photobios." ACM Transactions on Graphics (SIGGRAPH), Aug 2011. [pdf]



Finally, here is a very mathematical paper with a sexy title from the vision@google team.  It will be presented at the upcoming ICCV 2011 Conference in Barcelona -- the same conference where I'll be presenting my Exemplar-SVM paper.  


The Power of Comparative Reasoning
Jay Yagnik, Dennis Strelow, David Ross, Ruei-Sung Lin. ICCV 2011. [PDF]


P.S. If you're a fellow vision blogger, then come find me in Barcelona@iccv2011 -- we'll go brag a beer.

Tuesday, August 16, 2011

Question: What makes an object recognition system great?

Today, instead of discussing my own perspectives on object recognition or sharing some useful links, I would like to ask a general question geared towards anybody working in the field of computer vision:

What makes an object recognition system great?

In particular, I would like to hear a broad range of perspectives regarding what is necessary to provide an impact-creating open-source object recognition system for the research community to use.  As a graduate student you might be interested in building your own recognition system, as a researcher you might be interested in extending or comparing against a current system, and as an educator you might want to to direct your students to a fully-functional object recognition system which could be used to bootstrap their research.



To start the discussion I would like to first enumerate a few elements which I find important in making an object recognition system great.

Open Source
In order for object recognition to progress, I think releasing binary executables is simply not enough.  Allowing others to see your source code means that you gain more scientific credibility and you let others extend your system -- this means letting others both train and test variants of your system. More people using an object recognition system also translates to a high citation count, which is favorable for researchers seeking career advancement.  Felzenszwalb et al. have released multiple open-source version of their Discriminatively Trained Deformable Part Model -- each time we see a new release it gets better!  Such continual development means that we know the authors really care about this problem.  I feel Github, with its distributed version control and social-coding features, is a powerful took the community should adopt, something which I believe is very much needed to take the community's ideas to the next level.  In my own research (e.g., the Ensemble of Exemplar-SVMs approach), I have started using Github (for both private and public development) and I love it. Linux might have been started by a single individual, but it took a community to make it great.  Just look at where Linux is now.

Ease of use
For ease of use, it is important that the system is implemented in a popular language which is known by a large fraction of the vision community.  Matlab, Python, C++, and Java are such popular language and many good implementations are a combination of Matlab with some highly-optimized routines in C++.  Good documentation is also important since one cannot expect only experts to be using such a system.

Strong research results
The YaRS approach, which is the "yet-another-recognition-system" approach, doesn't translate to high usage unless the system actually performs well on a well-accepted object recognition task.  Every year at vision conferences, many new recognition frameworks are introduced, but really only a few of them ever pass the test of time.  Usually an ideas withstands time because it is a conceptual contribution to science, but systems such as the HOG-based pedestrian detector of Dalal-Triggs and the Latent Deformable Part Model of Felzenszwalb et al. are actually being used by many other researchers.  The ideas in these works are not only good, but the recognition systems are great.

Question:
So what would you like to see in the next generation of object recognition systems?  I will try my best to reply to any comments posted below.  Any really great comment might even trigger a significant discussion; enough to warrant its own blog post.  Anybody is welcome to comment/argue/speculate below, either using their real name or anonymously.