Disclaimer: I found this text file on my computer and I am deciding to post it as is. I probably never finished writing it, and meant to edit things later. But we all know how these things go (meaning that it would have never been finished).
So here it is:
First, let us discuss the notion of an object detector, then think about how the problem of image understanding is generally posed and finally look at a simple gedanken experiment.
I. The life of an object detector
An object detector's role is to localize an object in an image. Generally a large training set is obtained which consists of manually labelled images. These images contain an instance (or several instances) of the object of interest and additionally the location of the object of interest is also known. The training set is usually large enough to capture view of the object of interest under different orientations, scales, and lighting conditions. The training set has to generally be much larger if we want to detect a member of an object class such as a car versus the recognition of a particular object such as my car. Object detection refers to finding a member of a class while object recognition refers to finding that particular instance.
Why object detection? Generally one is interested in a particular vision task such as creating an autonomous vehicle that can drive on highways. In this case, the vision researcher can reason about the types of objects that are generally seen in the particular application and train object detection modules for each object type. A car detector, a road sign detector, a tree detector, a bridge detector, a road lane detector, a person detector, a grass detector, a cloud detector, a gas station detector, a police car detector, and a sun detector could be used in combination to create a pretty decent scene understanding system. This system would look at at an image and segment it by assigning each pixel in the image as belonging to one of those classes or the 'unclassified' category. This 'unclassified category' is also known as the background category, or the clutter category; it represents the 'uninteresting' stuff.
II. What comes after seeing
Awesome! I just concocted a recipe for creating an autonomous vehicle!
Unfortunately, there are several problems with this approach. First of all, segmenting an image into the categories spelled out above falls short of having a car known what it should do to navigate this visual world. I didn't talk about the segmentation of a image captured by a camera on top of a vehicle relates to navigation. Apparently, this recipe is only good for asking queries of he type, 'Where is object O in the image I?' Unfortunately, the only thing that is really interesting is the question, 'What do I do once i see image I?'
III. Semantic Segmentation
The problem of computer vision is traditionally posed as something like this: Given an image I, segment it into semantic categories and give me the 3D position and orientation of the objects found in the image with respect to the camera center. In addition we could also want information about lighting in the scene so that we can recover the true appearance of the objects we found in the image. I want to call this mapping of image into a set of locations/orientations of objects, object labels, and lighting conditions a semantic segmentation.
It seems that if we could obtain this semantic segmentation, we could then learn a mapping from a semantic segmentation to an action in order to have a real vision system.
--Omnipotency problem of vision
The problem with this approach is what I will refer to as the omnipotency problem of vision. This problem is that a vision system is required to know everything about the visual world in order to know what action to take. I honestly don't believe that we need to all this information about the world to know what to do. A vision system should only care about extracting the minimal amount of information from an image in order to know what to do next within some small error threshold.
-- Scaling problems with unbounded growth of object categories
Another problem with this the semantic segmentation approach is that it doesn't scale when you start looking at vision systems that can perform a large number of vision tasks. The number of object categories is extremely large!
-- Ill-defined object categories
The big problem that I want to talk about is the problem of defining the objects that our system would detect. Do we have a separate object for 'baby' and 'old-man' or treat them as just large geometric deformation of the concept 'human'? When you take one tire off of a car, it is still a car; but when you start taking more and more pieces off of it when does it cease to be a 'car'? Should a tree be considered one object or should we treat it as an assembly of {leaves,branches,trunk}? Clearly the notion of 'object' is ill-defined. I think the biggest problem with contemporary vision is that not enough people really see how grand of a problem it is. Computer Vision isn't only concerned with hacking out a driving system; the deep questions that arise are some of the deepest philosophical questions that have been around since the start of man's inquiry.
--Naive desires
Will we ever solve the problem of computer vision? When somebody thinks of this problem in a naive 'hack-out-a-system' kind of way, then one would also think 'why not?' However, when one sees beyond the systems, beyond the geometry, and beyond the statistical modeling then one can see that the problem of computer vision isn't really about computers at all! How do we (humans) live so effortlessly in this complex world around us? This question has many nuances, and every generation of great thinkers has asked a slightly different question. Of course this makes perfect sense, since each generation has been thinking within the paradigm of their time and it is probably not a good idea to even think of this as a variation of the same question when we consider the incommensurability of ideas across paradigm shifts.
--The big problem: The bold answer
Will we ever solve the problem of computer vision? Of course not. If you (the reader) still think that you can solve this problem, then you need to get out more.