Several days ago I was reading one of Takeo Kanade's classic computer vision papers from 1977 titled "Model Representation and Control Structure in Image Understanding" and I came across a new term, heterarchy. I think motivating this concept is as important as its definition. At the representational level, Kanade does a good job at advocating the use of multiple levels of representation -- from pixels to patches to regions to subimages to objects. In addition to discussing the representational aspects of image understanding systems, Kanade analyzes different strategies for using knowledge in such systems (he uses the term control structure to signify the overall flow of information between subroutines). On one extreme is pass-oriented processing (this is Kanade's term -- I prefer to use the terms feed-forward or bottom-up) which relies on iteratively building higher levels of interpretation from lower ones. Marr's vision pipeline is mostly bottom-up, but that discussion will be left for another post. Another extreme is top-down processing, where the image is analyzed in a global-to-local fashion. Of course, as of 2010 these ideas are being used on a regular basis in vision. One example is the paper Learning to Combine Bottom-Up and Top-Down Segmentation by Levin and Weiss.
Kanade acknowledges that the flow of a vision algorithm is very much dependent on the representation used. For image understanding, bottom-up as well as top-down processing will both be critical components of the entire system. However the exact strategy for combining these processes, in addition to countless other mid-level stages, is not very clear. Directly quoting Kanade, "The ultimate style would be a heterarchy, in which a number of modules work together like a community of experts with no strict central executive control." According to this line of thought, processing would occur in a loopy and cooperative style. Kanade attributes the concept of a heterarchy to Patrick Winston who worked with robots in the golden days of AI at MIT. Like Kanade, Winston criticizes a linear flow of information in scene interpretation (this criticism dates back to 1971). The basic problem outlined by both Kanade and Winston is that modules such as line-finders and region-finders (think segmentation) are simply not good enough to be used in subsequent stages of understanding. In my own research I have used the concept of multiple image segmentations to bypass some of the issued with relying on the output of low/mid -level processing for high-level processing. In 1971 Winston envisioned an algorithmic framework that is a melange of subroutines -- a web of algorithms created by different research groups -- that would interact and cooperate to understand an image. This is analogous to the development of an operating system like Linux. There is no overall theory developed by a single research group that made Linux a success -- it is the body of hackers and engineers that produced a wide range of software products that make using Linux a success.
Unfortunately given the tradition of computer vision research, I believe that an open-source-style group effort in this direction will not come out of university-style research (which is overly coupled with the publishing cycle). It would be a noble effort, but would more of a feat of engineering and not science. Imagine a group of 2-3 people creating an operating system from scratch -- it seems like a crazy idea in 2010. However, computer vision research is often done in such small teams (actually there is often a single hacker behind a vision project). But maybe going open-source and allowing several decades of interaction will actually produce usable image understanding systems. I would like to one day lead such an effort -- being both the theoretical mastermind as well as the hacker behind this vision. I am an INTJ, hear me roar.