Tuesday, November 18, 2008

Algorithmic Simplicity + Data > Algorithmic Complexity

Decades ago (during the era of Rodney Brooks, Takeo Kanade, and other such great computational thinkers) computer vision researchers were manually designing complex AI programs for image analysis. Back then, if the algorithm was able to work on a single real image it was publishable. The parmaters of some of these complicated models were often tuned by hand -- and that was okay -- there simply wasn't enough image data to fit these models from examples.

We are now living in a Machine Learning generation where hand tweaked parameters are looked down upon and if you want to publish an object recognition paper you'll need to test your algorithm on a standard dataset containing hundreds of images spanning many different types of objects. There is still a lot of excitement about Machine Learning in the air and new approaches are constantly being introduced as the new 'state-of-the-art' on canonical datasets. The problem with this mentality is that researchers are introducing a lot of complicated machinery and it is often unclear whether these new techniques will stand the test of time.

Peter Norvig -- now at Google -- advocates an alternative view. Rather than designing more advanced machine to work with a measly 20,000 or so training images for an object recognition task -- we shouldn't be too eager to make conclusions when dealing with such paltry training sets. In a recent Norvig video lecture I watched he showed some interesting results where the algorithms that obtained the best performance on a small dataset no longer did the best when the size of the training set was increased by an order of magnitude. In some cases, when fixing the test set, the simplest algorithms provided with an order of magnitude more training data outperformed the most advanced 'state-of-the-art.' Also, the mediocre algorithms in the small training size regime often outperformed their more complicated counterparts once more data was utilized.

The next generation of researchers will inevitably be using much more training data than we are at the moment, so if we want our scientific contributions to pass the test of time, we have to focus on designing simple yet principled algorithms. Focus on simplicity. Consider a particular recognition task, namely car recognition. Without any training data we are back in the 1960/1970s generation where we have to hard-code rules about what it means to be a car in order for an algorithm to work on a novel image. With a small amount of labeled training data, we can now learn the parameters of a general parts-based car detector -- we can even learn the appearance of such parts. But what can we do with millions of images of cars? Do we even need much more than a large scale nearest neighbor lookup?

As Rodney Brooks once said, "The world is its own best representation," and perhaps we should follow Google's mentaly and simply equip our ideas with more, more, more training data.

Tuesday, November 04, 2008

Computer Vision as immature?

Ashutosh Saxena points out on his web page an interesting quote from Wikipedia about Computer Vision from August 2007. I just checked out the Wikipedia article on Computer Vision, and it seems this paragraph is still there. Parts of it go as follows:

The field of computer vision can be characterized as immature and diverse ... Consequently there is no standard formulation of "the computer vision problem." ... no standard formulation of how computer vision problems should be solved.

I agree that there is no elegant equation akin to F=ma or Schrodinger's Wave Equation that is magically supposed to explain how meaning is supposed to be attributed to images. While this might seem like a weak point, especially to the mathematically inclined always seeking to generalize and abstract away, I am skeptical of Computer Vision ever being grounded in such an all-encompassing mathematical theory.

Being a discipline centered on perception and reasoning, there is something about Computer Vision that will make it forever escape formalization. State of the art computer vision systems that operate on images can return many different types of information. Some systems return bounding boxes of all object instances from a single category, some systems break up the image into regions (segmentation) and say nothing about object classes/categories, and other systems assign a single object-level category to the entire image without performing any localization/segmentation. Aside from objects, some systems (See Hoiem et al. and Saxena et al.) return a geometric 3D layout of the scene. While it seems that humans can do extremely well at all these tasks, it makes sense that different robotic agents interacting with the real world should percieve the world differently to accomplish their own varying tasks. Thing of biological vision -- do we see the same world as dogs? Is there an objective observer-independent reality that we are supposed to see? To me, perception is very personal, and while my hardware (brain) might appear similar to another human's I'm not convinced that we see/perceive/understand the world the same way.

I can imagine ~40 years ago researchers/scientists trying to come up with an abstract theory of computation that would allow one to run arbitrary computer programs. What we have today is myriad operating systems and programming languages suited for different crowds and different applications. While the humanoid robot in our living room is nowhere to be found, I believe if we wait until that day and inspect its internal working we will not see a beautiful rigorous mathematical theory. We will see AI/mechanical components developed by different researcher groups and integrated by other researchers -- the fruits of a long engineering effort. These bots will be always learning, always updating, always getting updates, and always getting replaced by newer and better ones.

Linear Support Vector Machine (SVM) in the Primal

I often solve linear SVMs. These are convex optimization problems, much like logistic regression, where the goal is to find a linear decision boundary between two classes. Unlike logistic regression, only the points near the decision boundary affect its parameters. Data points that are correctly classified by a margin do not influence the decision boundary.

Quite often when SVMs are taught in a Machine Learning course, the dual and kernelization are jumped into very quickly. While the formulation looks nice and elegant, I'm not sure how many students can come home after such a lecture and implement an SVM in a language such as MATLAB on their own.

I have to admit that I never really obtained a good grasp of Support Vector Machines until I sat through through John Lafferty's lectures in 10-702: Statistical Machine Learning which demistyfied them. The main idea was that an SVM is just like logistic regression but with a different loss function -- the hinge loss function.

A very good article written on SVMs, and how they can be efficiently tackled in the primal (which is super-easy to understand) is Olivier Chapelle's Training a Support Vector Machine in the Primal. Chapelle is a big proponent of primal optimization and he has some arguments on why primal approximations are better than dual approximations. On this webpage one can also find MATLAB code for SVMs which is very fast in practice since it uses a second order optimization technique. In order to use such a second order technique, the squared hinge loss is used (see the article for why). I have used this code many times in the past even for linear binary classification problems with over 6 million data points embedded in 7 dimensions.

In fact, this is the code that I have used in the past for distance function learning. So the next time you want to throw something simple and fast at a binary classification problem, check out Chapelle's code.