Thursday, March 24, 2011

Computer Vision is Artificial Intelligence

Computer vision is a diverse field and its researchers have multifaceted interests and aspirations.  It should not be surprising that no two vision researchers think about the field in the same way.  Different academic backgrounds foster alternative and potentially incommensurable interpretations.  It is as if W.V.O Quine's thesis that no observation can be "theory-independent" directly applies to vision: a researcher in computer vision cannot uphold a view on his own field that is objective and independent of their own predispositions, upbringing, and educational program.  While I cannot speak clearly about the long-term goals of the entire body researchers in vision, today I would like discuss my own take on computer vision.  I do not offer the world an objective account of why computer vision intrigues me, but by sharing with the world the reasons why I find vision exciting, perhaps together we can break the boundaries of machine intelligence.

Cognitive Science is a computational study of the mind: McGill Cognitive Science

One of the biggest accomplishments in the field of Artificial Intelligence was when Deep Blue, a chess playing program developed at IBM, beat the world chess champion, Garry Kasparov.  But this was in the early days of artificial intelligence -- when computer scientists still weren't sure on what it means for a machine to be intelligent.  Chess is a well-known thinking-man's game, and at first glance it seems that a machine can only be worthy of being dubbed intelligent if it performs competitively on intelligent-people activities such as chess.

Chess: Human vs. Machine: Slate article about Deep Blue

Given the plethora of tasks that humans can effortlessly perform in daily life, is engineering a machine to rival humans on just one such task bringing researchers any closer to building truly intelligent machines?

The problem with chess is that it has a "finite universe problem" -- there is a finite number of primitives (the chess pieces) which can be manipulated by choosing a move from a finite set of allowable actions.  But if we think of normal life (going to work, eating dinner, talking to a friend) as a game, then it is not hard to see that most everyday situations involving humans involve a sea of infinite objects (just look around and name all the different objects you can see around you!) and an equally capacious space of allowable actions (consider all the things you could with all those objects around you!).  Intelligence is what allows us to cope with the complexities of the universe by focusing our attention on a limited set of relevant variables -- but the working set of objects/concepts we must consider at any single instant is chosen from a seemingly infinite set of alternatives.

I believe that everyday human-level visual intelligence is greatly undervalued by people -- and there is a very good reason for this!  The ability to make sense of what is going on in a single picture is such a trivial and autonomous task for humans, that we don't even bother quantifying just how good we are at it.  But let me reassure you that automated image understanding is no trivial feat.  The world is not composed of 20 visual object categories and the space of allowable and interpretable utterances we could associate with a static picture is seemingly infinite.  While the 20 category object detection task (as popularized by the PASCAL VOC) does have a finite universe problem, the grander version of the vision master problem (a combination of detection/recognition/categorization where you can interpret an input any way you like) is much more complex and mirrors the structure of the external world well.

Robotics Challenge: Build a Robot like Bender

Any application which calls for automated analysis of images requires vision.  A robot, if it is to be successful interacting with the world and performing useful tasks, needs to perceive the external world and organize it.  While some see vision as just one small piece of the "Robotics Challenge" (build a robot and make it do cool stuff), it totally unclear to me where to draw the boundary between low-level pixel analysis and high-level cognitive scene understanding.  Over the years, I have been thinking more and more about this problem, and I've convinced myself that the interesting part of vision is precisely at the boundary between what is commonly thought of as low-level representation of signal and what is considered high-level representation of visual concepts.  While some view computer vision as "applied mathematics" or "applied machine learning" or "image processing in disguise", I passionately believe the following:

Computer Vision is Artificial Intelligence

I am not promulgating the thesis that all aspects of machine intelligence are visual, but I want to assure you that there are enough high-level semantic capabilities which must be set in place for vision to work, that it is not worthwhile to think of vision as smaller problem than general purpose intelligence.  I believe that once we have made progress on vision (not in the narrow-universe setting) to the point where generic visual scene understanding is effectively solved, there won't be much left that needs to go into the "ethereal" mind which cognitive scientists want to empower machines with!  The only way to make machines truly understand scenes, objects, and their interactions is to make machines know something about the fabric of human life, and it is important for machines to learn this for themselves from real-world experience.  This goes beyond representing object appearance because folk physics, folk psychology, causality, spatio-temporal continuity, etc are all faculties which vision systems will need (at least the vision systems I want to ultimately build) for general purpose scene understanding.  I don't want to undermine the efforts of cognitive scientists (which work on many of the theories/ideas I've delineated before), but perhaps only to convince them that I have been a cognitive scientist all along.  I don't think placing a label on myself, by calling myself as either a cognitive scientist, a computer vision researcher, or AI researcher is very conducive to good research.