Thursday, March 26, 2009

Beyond Categorization: Getting Away From Object Categories in Computer Vision

Natural language evolved over thousands of years to become the powerful tool that is is today. When we say things using language to convey our experiences with the world, we can't help but refer to object categories. When we say things such as "this is a car" what we are actually saying is "this is an instance from the car category." Categories let us get away from referring to individual object instances -- in most cases knowing that something belongs to a particular category is more than enough knowledge to deal with it. This is a type of "understanding by compression" or understanding by abstracting away the unnecessary details. In the words of Rosch, "the task of category systems is to provide maximum information with the least cognitive effort." Rosch would probably agree that it only makes sense to talk about the utility of a category system (a for getting a grip on reality) as opposed to the truth value of a category system with respect how well it aligns to observer-independent reality. The degree of pragmatism expressed by Rosch is something that William James would have been proud of.

From a very young age we are taught language and soon it takes over our inner world. We 'think' in language. Language provides us with a list of nouns -- a way of cutting up the world into categories. Different cultures have different languages that cut up the world differently and one might wonder how well the object categories contained in any given single language correspond to reality -- if it even makes sense to talk about an observer independent reality. Rosch would argue that human categorization is the result of "psychological principles of categorization" and is more related to how we interact with the world than how the world is. If the only substances we ingested for nutrients were types of grass, then categorizing all of the different strains of grass with respect to flavor, vitamin content, color, etc would be beneficial for us (as a species). Rosch points out in her works that her ideas refer to categorization at the species-level and she calls it human categorization. She is not referring to a personal categorization; for example, the way a child might cluster concepts when he/she starts learning about the world.

It is not at all clear to me whether we should be using the categories from natural language as the to-be-recognized entities in our image understanding systems. Many animals do not have a language with which they can compress percepts into neat little tokens -- yet they have no problem interacting with the world. Of course, if we want to build machines that understand the world around them in a way that they can communicate with us (humans), then language and its inherent categorization will play a crucial role.

While we ultimately use language to convey our ideas to other humans, how early are the principles of categorization applied to perception? Is the grouping of percepts into categories even essential for perception? I doubt that anybody would argue that language and its inherent categorization is not useful for dealing with the world -- the only question is how it interacts with perception.

Most computer vision researchers are stuck in the world of categorization and many systems rely on categorization at a very early stage. A problem with categorization is its inability to deal with novel categories -- something which humans must deal with at a very young age. We (humans) can often deal with arbitrary input and using analogies can still get a grip and the world around us (even when it is full of novel categories). One hypothesis is that at the level of visual perception things do not get recognized into discrete object classes -- but a continuous recognition space. Thus instead of asking the question, "What is this?" we focus on similarity measurements and ask "What is this like?". Such a comparison-based view would help us cope with novel concepts.


  1. I think that's a good idea.
    The way i see it, both classes and observations (features) should have some representation. Then one would have distance functions not only between observations and classes, but also between classes and classes, and between observations and observations. One could cluster classes for example (all kinds of 4-legged animals would be close, such as dogs and cows). I think there's a lot of stuff to be done there.

  2. I think the preoccupation with categories stems not from scientific pursuit of objective reality but from practical evaluation metrics ("how well does this algorithm find horses in meadows?"). As the evaluation metrics become less category-oriented, so too should the algorithms.

    If you are interested in continuous semantic spaces, you should talk to Mark Palatucci and Tom Mitchell.

  3. Good observation. Maybe the visual system make finer and finer divisions than categories, which are more similar, and compound them together. Just like the Biological Inspired Model by Poggio. It just remembers all possible variations of an instance and combine them into higher levels.

  4. Mark is a friend of mine and we have had many interesting discussions on categorization and semantic spaces.

    Thanks for the comments guys!

  5. Some of the ideas in this post can be found in my NIPS 2009 paper:

    Tomasz Malisiewicz, Alexei A. Efros. Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships. In NIPS, December 2009.