Wednesday, June 06, 2007

a little bit more than computer vision?

Most object detection approaches in computer vision rely on computing features and a powerful classifier which predicts the presence of an object from some predefined object class such as bike, car, pedestrian, cat.

The 2007 PASCAL Visual Object Classes Challenge has 10 more object categories than the previous year. Three object categories that I would like to discuss are: 1.) chair, 2.) sofa, and 3.) tv/monitor.

Here are some images containing instances from those three object categories:



A problem arises when once considers the difference between a sofa and a chair. What is the visual difference between a large chair and a small sofa? Is it even worthwhile to engineer features to discriminate between these two object categories? Functionally and contextually, both are very similar object categories (things to sit on). The biggest difference between chairs and sofas is not only the size, but also the rooms they are located in. Unfortunately, is this something that an algorithm can learn by only training from instances of chairs and sofas?

Another interesting category in the PASCAL challenge is tv/monitor. For some reason a lapton screen is not considered an example of this category, and if an algorithm labeled the macbook monitor above as an instance of tv/monitor it would have been deemed incorred by the challenge! Perhaps the only way of determining that a screen is not a part of a laptop is by looking for the presence of a connected keyboard -- however, what is a disconnected keyboard is spatially close to a computer monitor? Are we expected to make such subtle distinctions?

What I'm trying to get at here, is that some of the object classes in the PASCAL 2007 challenge are ridiculous! While I do think that it is possible to detect these things in theory, the subtleties that are addressed in this challenge are simply beyond the realm of computer vision. If you consider the reasoning about object function, 3D layout, and context that is necessary to detect some of these "objects," then I do not expect any modern algorithm to do well for these object categories in 2007. Personally, I believe that the visual appearance of a "sofa" or "monitor" is not really very important in determining that it is that object. Once we realize that object recognition requires a little bit more than a powerful descriptor of the visual appearance, should we still treat object recognition as a part of computer vision? Computer metaphysics?