Tuesday, January 26, 2010

Beyond Categories =? Doing without Concepts

The term "beyond category," from my limited knowledge, was originally coined to describe the music of Duke Ellington. It is a term of praise that acknowledges that one's style is inimitable and transcends barriers.

"Beyond Categories" was the first part of my NIPS 2009 paper's title. To "go beyond" means to transcend, to abandon or do without some limitation and strive higher -- there is nothing magical about my use of the term. I used the term category to refer to object categories, as are commonly used in computer vision, artificial intelligence, machine learning, as well as psychology, philosophy, and other branches of cognitive science. One of my research goals is to go beyond the use of categories as the basis for machine perception and visual reasoning. It has been argued by Machery that the term category is roughly equivalent to the term concept as used in psychology literature. In some sense the title of Machery's recent book, "Doing without concepts," is analogous to the phrase "Beyond categories" but to reassure myself I'll have to finish reading Machery's book.

So far the first chapter has been a delightful exposition into the world of concepts, a term dear to researchers in machine perception (AI) as well as human categorization (psychology). I look forward to reading the rest of the book, which I accidentally found while looking for Estes' book on categorization. I had already digested/assimilated some of Machery's work, in particular his paper titled Concepts are not a natural kind, so seeing his name on a book at the CMU library piqued my interest. In this 2005 paper, Machery argues that the debate between prototypes vs. exemplars vs. theories in the literature on concepts is not well-founded and there is no reason to believe a single theory should prevail. I'll attempt to summarize some of his take-home messages and their relevance to computer vision once I finish this book.

Wednesday, January 20, 2010

Heterarchies and Control Structure in Image Interpretation

Several days ago I was reading one of Takeo Kanade's classic computer vision papers from 1977 titled "Model Representation and Control Structure in Image Understanding" and I came across a new term, heterarchy. I think motivating this concept is as important as its definition. At the representational level, Kanade does a good job at advocating the use of multiple levels of representation -- from pixels to patches to regions to subimages to objects. In addition to discussing the representational aspects of image understanding systems, Kanade analyzes different strategies for using knowledge in such systems (he uses the term control structure to signify the overall flow of information between subroutines). On one extreme is pass-oriented processing (this is Kanade's term -- I prefer to use the terms feed-forward or bottom-up) which relies on iteratively building higher levels of interpretation from lower ones. Marr's vision pipeline is mostly bottom-up, but that discussion will be left for another post. Another extreme is top-down processing, where the image is analyzed in a global-to-local fashion. Of course, as of 2010 these ideas are being used on a regular basis in vision. One example is the paper Learning to Combine Bottom-Up and Top-Down Segmentation by Levin and Weiss.

Kanade acknowledges that the flow of a vision algorithm is very much dependent on the representation used. For image understanding, bottom-up as well as top-down processing will both be critical components of the entire system. However the exact strategy for combining these processes, in addition to countless other mid-level stages, is not very clear. Directly quoting Kanade, "The ultimate style would be a heterarchy, in which a number of modules work together like a community of experts with no strict central executive control." According to this line of thought, processing would occur in a loopy and cooperative style. Kanade attributes the concept of a heterarchy to Patrick Winston who worked with robots in the golden days of AI at MIT. Like Kanade, Winston criticizes a linear flow of information in scene interpretation (this criticism dates back to 1971). The basic problem outlined by both Kanade and Winston is that modules such as line-finders and region-finders (think segmentation) are simply not good enough to be used in subsequent stages of understanding. In my own research I have used the concept of multiple image segmentations to bypass some of the issued with relying on the output of low/mid -level processing for high-level processing. In 1971 Winston envisioned an algorithmic framework that is a melange of subroutines -- a web of algorithms created by different research groups -- that would interact and cooperate to understand an image. This is analogous to the development of an operating system like Linux. There is no overall theory developed by a single research group that made Linux a success -- it is the body of hackers and engineers that produced a wide range of software products that make using Linux a success.

Unfortunately given the tradition of computer vision research, I believe that an open-source-style group effort in this direction will not come out of university-style research (which is overly coupled with the publishing cycle). It would be a noble effort, but would more of a feat of engineering and not science. Imagine a group of 2-3 people creating an operating system from scratch -- it seems like a crazy idea in 2010. However, computer vision research is often done in such small teams (actually there is often a single hacker behind a vision project). But maybe going open-source and allowing several decades of interaction will actually produce usable image understanding systems. I would like to one day lead such an effort -- being both the theoretical mastermind as well as the hacker behind this vision. I am an INTJ, hear me roar.

Monday, January 18, 2010

Understanding versus Interpretation -- a philosophical distinction

Today I want to bring up an interesting discussion regarding the connotation of the word "understanding" versus "interpretation," particularly in the context of "scene understanding" versus "scene interpretation." While many vision researchers use these terms interchangeably, I think it is worthwhile to make the distinction, albeit a philosophical one.

On Understanding
While everybody knows that the goal of computer vision is to recognize all of the objects in an image, there is plenty of disagreement about how to represent objects and recognize them in the image. There is a physicalist account (from Wikipedia: Physicalism is a philosophical position holding that everything which exists is no more extensive than its physical properties), where the goal of vision is to reconstruct veridical properties of the world. This view is consistent with the realist stance in philosophy (think back to Philosophy 101) -- there exists a single observer-independent 'ground-truth' regarding the identities of all of the objects contained in the world. The notion of vision as measurement is very strong under this physicalist account. The stuff of the world is out there just waiting to be grasped! I think the term "understanding" fits very well into this truth-driven account of computer vision.

On interpretation
The second view, a postmodern and anti-realist one, is of vision as a way of interpreting scenes. The shift is from veridical recovery of the properties of the world from an image (measurement) to the observer-dependent interpretation of the input stimulus. Under this account, there is no need to believe in a god's eye 'objective' view of the world. Image interpretation is the registration of an input image with a vast network of past experience, both visual and abstract. The same person can vary their own interpretation of an input as time passes and the internal knowledge based has evolved. Under this view, two distinct robots could provide very useful yet distinct 'image interpretations' of the same input image. The main idea is that different robots could have different interpretation-spaces, that is they could obtain incommensurable (yet very useful!) interpretations of the same image.

It has been argued by Donald Hoffman (Interface Theory of Perception) that there is no reason why we should expect evolution to have driven humans towards veridical perception. In fact, Hoffman argues that natures drives veridical perception towards extinction and it only makes sense to speak of perception as guiding agents towards pragmatic interpretations of their environment.

In philosophy of science, there is the debate of whether the field of physics is unraveling some ultimate truth about the world versus physics painting a coherent and pragmatic picture of the world. I've always viewed science as an art and I embrace my anti-realist stance -- which has been shaped by Thomas Kuhn, William James, and many others. While my scientific interests have currently congealed in computer vision, it is no surprise that I'm finding conceptual agreement between my philosophy of science and my concrete research efforts in object recognition.

Thursday, January 14, 2010

Tuesday, January 12, 2010

Image Interpretation Objectives

An example of a typical complex outdoor natural scene that a general knowledge-based image interpretation system might be expected to understand is shown in Figure 1. An objective of such systems is to identify semantically meaningful visual entities in a digitized and segmented image of some scene. That is, to correctly assign semantically meaningful labels (e.g., house, tree, grass, and so on) to regions in an image -- see [29,30]. A computer-based image interpretation system can be viewed as having two major components, a "low-level" component and a "high-level" component [19],[31]. In many respects, the low-level portion of the system is designed to mimic the early stages of visual image processing in human-like systems. In these early stages, it is believed that scenes are partitioned, to some extent, into regions that are homogeneous with respect to some set of perceivable features (i.e., feature vector) in the scene [6],[40],[39]. To this extent, most low-level general purpose computer vision systems are designed to perform the same task. An example of a partitioning (i.e., segmentation) of Figure 1 into homogeneous regions is shown in Figure 2. The knowledge-based computer vision system we shall describe in this paper is not currently concerned with resegmenting portions of an image. Rather, its task is to correctly label as many regions as possible in a given segmentation.

This a direct quote from a 1984 paper on computer vision. A great example of segmentation-driven scene understanding. The content is similar enough to my own line of work that it could have been an excerpt from my own thesis.

It is actually in a section called Image Interpretation Objectives from "Evidential Knowledge-Based Computer Vision" by Leonard P. Wesley, 1984. I found this while reading lots of good tech reports from SRI International's AI Center in Menlo Park. Some good stuff there by Tenenbaum, Barrow, Duda, Hart, Nillson, Fischler, Pereira, Pentland, Fua, Szeliski, to name a few. Lots of stuff there is relevant to scene understanding and grounds the problem in robotics (since there was no "internet" vision back in the 70s and 80s).

On another note, I still haven't been able to find a copy of the classic paper, Experiments in Interpretation-Guided Segmentation by Tenenbaum and Barrow from 1978. If anybody knows where to find a pdf copy send me an email. UPDATE: Thanks to the quick reply! I have the paper now.