Tuesday, November 24, 2009

Understanding the role of categories in object recognition

If we set aside our academic endeavors of publishing computer vision and machine learning papers and sincerely ask ourselves, "What is the purpose of recognition?" a very different story emerges.

Let me first outline the contemporary stance on recognition (that is object recognition as is embraced by the computer vision community), which is actually a bit of a "non-stance" because many people working on recognition haven't bothered to understand the motivations, implications, and philosophical foundations of their work. The standard view of recognition is that it is equivalent to categorization -- assigning an object its "correct" category is the goal of recognition. Object recognition, as is found in vision papers, is commonly presented as single image recognition task which is not tied to an active and mobile agent that must understand and act in an environment around them. These contrived tasks are partially to blame for making us think that categories are the ultimate truth. Of course, once we've pinpointed the correct category we can look up information about the object category at hand in some sort of grand encyclopedia. For example, once we've categorized an object as a bird we can simply recall the fact that "it flies" from such a source of knowledge.

Most object recognition research is concerned with object representations (what features to compute from an image) as well as supervised (and semi-supervised) machine learning techniques to learn object models from data in order to discriminate and thus "recognize" object categories. The reason why object recognition has become so popular in the recent decade is that many researchers in AI/Robotics envision a successful vision system as a key component in any real-world robotic platform. If you ask a human to describe their environment, we will probably use a bunch of nouns to enumerate the stuff around them, so surely nouns must be the basic building blocks of reality! In this post I want to question this commonsense assumption that categories are the building blocks of reality and propose a different way of coping with reality, one that doesn't try to directly estimate a category from visual data.

I argue that just because nouns (and the categories they refer to) are the basis of effability for humans, it doesn't mean that nouns and categories are the quarks and gluons of recognition. Language is a relatively recent phenomenon for humans (think evolutionary scale here), and it is absent in many animals inhabiting the earth beside us. It is absurd to think that animals do not possess a faculty for recognition just because they do not have a language. Since animals can quite effectively cope with the world around them, there must be hope for understanding recognition in a way that doesn't invoke linguistic concepts.

Let me make my first disclaimer. I am not against categories altogether -- they have their place. The goal of language is human-human communication and intelligent robotic agents will inevitably have to map their internal modes of representation onto human language if we are to understand and deal with such artificial beings. I just want to criticize the idea that categories are found deep within our (human) neural architecture and serve as the basis for recognition.


Imagine a caveman and his daily life which requires quite a bit of "recognition"-abilities to cope with the world around him. He must differentiate pernicious animals from edible ones, distinguish contentious cavefolk from his peaceful companions, and reason about the plethora of plants around him. For each object that he recognizes, he must be able to determine whether it is edible, dangerous, poisonous, tasty, heavy, warm, etc. In short, recognition amounts to predicting a set of attributes associated with an object. Recognition is the linking of perceptible attributes (it is green and the size of my fist) to our past experiences and predicting attributes that are not conveyed by mere appearance. If we see a tiger, it is solely on our past experiences that we can call it dangerous.

So imagine a vector space, where each dimension encodes an attribute such as edible, throwable, tasty, poisonous, kind, etc. Each object can be represented as a point in this attribute space. It is language that gives us categories as a shorthand to talk about commonly found objects. Different cultures would give rise to different ways of cutting up the world, and this is consistent with what has been observed by psychologists. Viewing categories as a way of compressing attribute vectors not only makes sense but is in agreement with the idea that categories culturally arose much later than the ability for humans to recognize objects. Thus it makes sense to think of category-free recognition. Since a robotic agent who was programmed to think of the world in terms of categories will have to unroll categories to understand objects in terms of tangible properties if they are to make sense of the world around them, why not use the properties/attributes as the primary elements of recognition in the first place!?



These ideas are not entirely new. In Computer Vision, there is a CVPR 2009 paper Describing objects by their attributes by Farhadi, Endres, Hoiem, and Forsyth (from UIUC) which strives to understand objects directly using the ideas discussed above. In the domain of thought recognition, the paper Zero-Shot Learning with Semantic Output Codes by Palatucci, Pomerleau, Hinton, and Mitchell strives to understand concepts in a similar semantic basis.

I believe the field of computer vision has been conceptually stuck and the vehement reliance on rigid object categories is partially to blame. We should read more Wittgenstein and focus more on understanding vision as a mere component of artificial intelligence. If we play the recognize objects in a static image game (as Computer Vision is doing!) then we obtain a fragmented view of reality and cannot fully understand the relationship between recognition and intelligence.

Thursday, November 12, 2009

Learning and Inference in Vision: from Features to Scene Understanding


Tomorrow, Jonathan Huang and I are giving a Computer Vision tutorial at the First MLD (Machine Learning Department) Research Symposium at CMU. The title of our presentation is Learning and Inference in Vision: from Features to Scene Understanding.

The goal of the tutorial is to expose Machine Learning students to state-of-the-art object recognition, scene understanding and the inference problems associated with such high-level recognition problems. Our target audience is graduate students with little or no prior exposure to object recognition who would like to learn more about the use of probabilistic graphical models in Computer Vision. We outline the difficulties present in object recognition/detection and outline several different models for jointly reasoning about multiple object hypotheses.

Saturday, November 07, 2009

A model of thought: The Associative Indexing of the Memex

The Memex "Memory Extender" is an organizational device, a conceptual device, and a framework for dealing with conceptual relationships in an associative way. Abandoning the Aristotelian tradition of rooting concepts in definitions, the Memex suggests an association-based, non-parametric, and data-driven representation of concepts.

Since the mind=software analogy is so deeply engraved in my thoughts, it is hard for me to see intelligent reasoning as anything but a computer program (albeit one which we might never discover/develop). It is worthwhile to see sketches of the memex from an era before computers. (See Figure below). However, with the modern Internet, a magnificent example of a Bush's ideology, with links denoting the associations between pages, we need no better analogy. Bush's critique of the artificiality of traditional schemes of indexing resonates in the world wide web.


A Mechanical Memex Sketch

By extrapolating Bush's anti-indexing argument to visual object recognition, I realize that the blunder is to assign concepts to rigid categories. The desire to break free from categorization was the chief motivation for my Visual Memex paper. If Bush's ideas were so successful in predicting the modern Internet, we should ask ourselves, "Why are categories so prevalent in computational models of perception?" Maybe it is machine learning, with its own tradition of classes in supervised learning approaches, that has scarred the way we computer scientists see reality.

“The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature.” -- Vannevar Bush

Is Google's grasp of the world of information anything more than a Memex? I'm not convinced that it is not. While the feat of searching billions of web pages in real time has already been demonstrated by Google (and reinforced every day), the best computer vision approaches as of today resemble nothing like Google's data-driven way of representing concepts. I'm quite interested in pushing this link-based data-driven mentality to the next level in the field of Computer Vision. Breaking free from the categorization assumptions that plague computational perception might the the key ingredient in the recipe for success.

Instead of summarizing, here is another link to a well-written article on the Memex by Finn Brunton. Quoting Brunton, "The deepest implications of the Memex would begin to become apparent here: not the speed of retrieval, or even the association as such, but the fact that the association is arbitrary and can be shared, which begins to suggest that, at some level, the data itself is also arbitrary within the context of the Memex; that it may not be “the shape of thought,” emphasis on the the, but that it is the shape of a new thought, a mediated and mechanized thought, one that is described by queries and above all by links."

Thursday, November 05, 2009

The Visual Memex: Visual Object Recognition Without Categories


Figure 1
I have discussed the limitations of using rigid object categories in computer vision, and my CVPR 2008 work on Recognition as Association was a move towards developing a category-free model of objects. I was primarily concerned with local object recognition where the recognition problem was driven by the appearance/shape/texture features derived from within a segment (a region extraction from an image using an image segmentation algorithm). Recognition of objects was done locally and independently per region, since I did not have good model of category-free context at that time. I've given the problem of contextual object reasoning much thought over the past several years, and equipped with the power of graphical models and learning algorithms I now present a model for category-free object relationship reasoning.

Now its 2009, and its no surprise that I have a paper on context. Context is the new beast and all the cool kids are using it for scene understanding; however, categories are used so often for this problem that their use is rarely questioned. In my NIPS 2009 paper, I present a category-free model of object relationships and address the problem of context-only recognition where the goal is to recognize an object solely based on contextual cues. Figure 1 shows an example of such a prediction task. Given K objects and their spatial configuration, is it possible to predict the appearance of a hidden object at some spatial location?

Figure 2


I present a model called the Visual Memex (visualized in Figure 2), which is a non-parametric graph-based model of visual concepts and their interactions. Unlike traditional approaches to object-object modeling which learn potentials between every pair of categories (the number of such pairs scales quadratically with the number of categories), I make no category assumptions for context.

The official paper is out, and can be found on my project page:

Tomasz Malisiewicz, Alexei A. Efros. Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships. In NIPS, December 2009. PDF

Abstract: The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the object's relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralba's proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.

I gave at talk about my work yesterday at CMU's Misc-read and received some good feedback. I'll be at NIPS this December representing this body of research.