Tuesday, November 24, 2009

Understanding the role of categories in object recognition

If we set aside our academic endeavors of publishing computer vision and machine learning papers and sincerely ask ourselves, "What is the purpose of recognition?" a very different story emerges.

Let me first outline the contemporary stance on recognition (that is object recognition as is embraced by the computer vision community), which is actually a bit of a "non-stance" because many people working on recognition haven't bothered to understand the motivations, implications, and philosophical foundations of their work. The standard view of recognition is that it is equivalent to categorization -- assigning an object its "correct" category is the goal of recognition. Object recognition, as is found in vision papers, is commonly presented as single image recognition task which is not tied to an active and mobile agent that must understand and act in an environment around them. These contrived tasks are partially to blame for making us think that categories are the ultimate truth. Of course, once we've pinpointed the correct category we can look up information about the object category at hand in some sort of grand encyclopedia. For example, once we've categorized an object as a bird we can simply recall the fact that "it flies" from such a source of knowledge.

Most object recognition research is concerned with object representations (what features to compute from an image) as well as supervised (and semi-supervised) machine learning techniques to learn object models from data in order to discriminate and thus "recognize" object categories. The reason why object recognition has become so popular in the recent decade is that many researchers in AI/Robotics envision a successful vision system as a key component in any real-world robotic platform. If you ask a human to describe their environment, we will probably use a bunch of nouns to enumerate the stuff around them, so surely nouns must be the basic building blocks of reality! In this post I want to question this commonsense assumption that categories are the building blocks of reality and propose a different way of coping with reality, one that doesn't try to directly estimate a category from visual data.

I argue that just because nouns (and the categories they refer to) are the basis of effability for humans, it doesn't mean that nouns and categories are the quarks and gluons of recognition. Language is a relatively recent phenomenon for humans (think evolutionary scale here), and it is absent in many animals inhabiting the earth beside us. It is absurd to think that animals do not possess a faculty for recognition just because they do not have a language. Since animals can quite effectively cope with the world around them, there must be hope for understanding recognition in a way that doesn't invoke linguistic concepts.

Let me make my first disclaimer. I am not against categories altogether -- they have their place. The goal of language is human-human communication and intelligent robotic agents will inevitably have to map their internal modes of representation onto human language if we are to understand and deal with such artificial beings. I just want to criticize the idea that categories are found deep within our (human) neural architecture and serve as the basis for recognition.


Imagine a caveman and his daily life which requires quite a bit of "recognition"-abilities to cope with the world around him. He must differentiate pernicious animals from edible ones, distinguish contentious cavefolk from his peaceful companions, and reason about the plethora of plants around him. For each object that he recognizes, he must be able to determine whether it is edible, dangerous, poisonous, tasty, heavy, warm, etc. In short, recognition amounts to predicting a set of attributes associated with an object. Recognition is the linking of perceptible attributes (it is green and the size of my fist) to our past experiences and predicting attributes that are not conveyed by mere appearance. If we see a tiger, it is solely on our past experiences that we can call it dangerous.

So imagine a vector space, where each dimension encodes an attribute such as edible, throwable, tasty, poisonous, kind, etc. Each object can be represented as a point in this attribute space. It is language that gives us categories as a shorthand to talk about commonly found objects. Different cultures would give rise to different ways of cutting up the world, and this is consistent with what has been observed by psychologists. Viewing categories as a way of compressing attribute vectors not only makes sense but is in agreement with the idea that categories culturally arose much later than the ability for humans to recognize objects. Thus it makes sense to think of category-free recognition. Since a robotic agent who was programmed to think of the world in terms of categories will have to unroll categories to understand objects in terms of tangible properties if they are to make sense of the world around them, why not use the properties/attributes as the primary elements of recognition in the first place!?



These ideas are not entirely new. In Computer Vision, there is a CVPR 2009 paper Describing objects by their attributes by Farhadi, Endres, Hoiem, and Forsyth (from UIUC) which strives to understand objects directly using the ideas discussed above. In the domain of thought recognition, the paper Zero-Shot Learning with Semantic Output Codes by Palatucci, Pomerleau, Hinton, and Mitchell strives to understand concepts in a similar semantic basis.

I believe the field of computer vision has been conceptually stuck and the vehement reliance on rigid object categories is partially to blame. We should read more Wittgenstein and focus more on understanding vision as a mere component of artificial intelligence. If we play the recognize objects in a static image game (as Computer Vision is doing!) then we obtain a fragmented view of reality and cannot fully understand the relationship between recognition and intelligence.

16 comments:

  1. Anonymous5:29 AM

    Attribute-based recognition seems like a fantastic idea. But it seems that attributes are also categories. Tail is a category right? The same for leg.

    ReplyDelete
  2. I would argue that attribute-based recognition is quite different from category-based recognition but as I discussed them attributes seem like they have categories of their own. The chief difference is that in a category-based view, the world of objects is cut up into rigid categories and in an attribute-based view each object is more flexibly represented as a point in an attribute vector space.

    In the paper "Describing objects by their attributes" some of the attributes chose are indeed very similar to object categories. It's not clear to me what the attributes should be, if they should be binary, discrete, or continuous. However, I think the attributes should capture non-visual properties of objects such as "is edible" and "is graspable with one hand" because only by understanding objects in terms of these concepts will a robotic agent have true understanding of the world around them. In summary, attributes should be tied to the sense modalities and motor/action capabilities of the agent using vision as a means to an end and not be purely visual "is furry" or part based "has tail" and "has leg".

    ReplyDelete
  3. Anonymous12:17 AM

    This idea of propery/attribute based recognition also appears in some early papers of Whitman Richards, e.g. How To Play Twenty Questions With Nature and Win

    ReplyDelete
  4. Thanks for the Whitman Richards reference. The 20 question game is definitely relevant.

    ReplyDelete
  5. Just in case you did not look at this: http://www.idiap.ch/~bcaputo/icvw09.html

    ReplyDelete
  6. Phong8:34 AM

    It seems fancy from your idea, but I think object categorization is still the inevitable component in an intelligent system. Object's attributes cannot be learned from just one image, so we have to prepare a database of knowledge about them. Once object is recognized correctly, an inference engine will pull out the right attributes from database to "annotate"to that object. What happen if a zebra is recognized as a tiger? Basically, object categorization is indeed the right goal to pursuit.

    ReplyDelete
  7. Tomasz-- another recent attributes paper from UIUC:

    Joint learning of visual attributes, object classes and visual saliency, ICCV 2009.

    and a specific one for faces:
    Attribute and simile
    classifiers for face verification. ICCV 2009.


    It is true to say that it isn't always clear what is an attribute and what is an object (i.e. "has tail") but that shouldn't lead us to conclude that attributes are not a useful (or even the most fundamental) approach for vision.

    ReplyDelete
  8. Santosh -- Thanks for the link. The ideas advocated in this workshop are definitely up my alley. This is the type of venue that would most likely agree with the discussions/questions from my blog.

    Phong -- I agree that object attributes cannot be learned from one image. The problem of learning object categories or object attributes from a single image is ill-posed and a very difficult problem. I agree that once categorization has been performed it is easy to pull out the right attributes from a large database which maintains the attributes or a given category. What I wanted to argue in this blog post is that if a robotic agent is to make sense AND interact with the world it will have to reason about objects in terms of attributes. Thus if the attributes are somehow more important than object categories, perhaps we should ask ourselves if we can predict attributes without estimating categories first.

    ReplyDelete
  9. Andrew -- thanks for the pointers. I've seen these papers before but haven't gotten around to reading them in detail.

    I definitely agree that attributes haven't gotten the attention they deserve in the field of computer vision. Whether they are "more fundamental" than object categories I'm not sure yet. I definitely think it is a worthwhile discussion. Unfortunately the method under which the computer vision community evaluates progress in object recognition is biased towards categories. I surmise that when the community seriously considers embodied vision systems and leaves behind non-interactionable vision, we will see a significantly stronger interest in property/attribute-based recognition.

    ReplyDelete
  10. Anonymous10:07 AM

    It might even be that recognition isn't required at all for attributes. If an agent can get some idea about the 3d shape of the object then attributes can be computed directly from that. I'm thinking about "can grab", "can throw", "can seat on".

    I mean, there's no need to relate the object to things, or examples, in the agent's memory for such things.

    ReplyDelete
  11. Dear Anonymous,

    I want to comment on your idea of computing attributes directly from 3d shape information *without* recognition. First of all, computing 3d shape information from a static image is no easy task. Now when we're talking about an agent, we can have access to 2.5D imagery (from stereo or laser rangefinding), but we'll still have to segment out the object from the background. If the object is stationary on a stationary background (think of a phone on top of a stack of yellow pages books which is on top of a desk), then the agent will have to use prior experience to segment out the object. Without using information about how phones/books/tables look like (and/or their 3d shape), how will the agent know how many objects there are? What will prevent the agent from thinking that it is seeing a book-o-phone object instead of two distinct objects.

    In short, I doubt any sort of coping with the environment (whether it is attribute-based on category-based) is possible without sorting to the wealth of past experience that an agent must possess. The question of whether past experience is stored in a memory-like system, or abstract category-based models are retained, is still open to debate.

    I do believe that in some highly controlled environments reasoning about attributes directly from perception (Gibson-style "direct perception") is possible, but not for real natural environments. What this means is that such simple and silly environments can be useful (think of a child playing with a single toy in a clutter-free playpen) during training, but the ultimate test will have to be done in a more malicious setting.

    ReplyDelete
  12. The concept of catergorization could be applied in another field, namely document search and machine translation. What I have in mind is tagging words in a document. Examples would be tagging proper names to separate Baker the name from baker the occupation. Second parts of speech could be coded to avoid the "time flies" ambiguity. Third, present and past tense could be market for words like put which use the same word for both. Finally, the big one would be to identify multiple meanings of the same word.

    I realize this would be an enormous tasks that be only partially automated. Might it be worth it?

    ReplyDelete
  13. Anonymous5:09 PM

    On a related note, you may be interested in the following recent edited volume:

    Object Categorization: Computer and Human Vision Perspectives by Sven J. Dickinson, Ales Leonardis, Bernt Schiele, and Michael J. Tarr (Hardcover - Sep 7, 2009)

    Dickinson especially has a nice historical survey of object categorization in computer vision.

    ReplyDelete
  14. I'm right now drinking coffee with a nice red copy of this Categorization book next to me. I'm on page 56 reading Perona's essay. I have read most of the chapters, and I agree that Dickinson's treatise is excellent. I have commented on Edelman's and Bar's chapters before on my blog (even when those articles weren't part of this book yet).

    ReplyDelete
  15. This attribute-based approach remind me of Python's duck typing approach of classes: "when I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck."

    It's interesting.

    Working on knowledge processing (ontologies et caetera), I should as well add that the border between classes and attributes is relatively thin. Very often, classes (ie categories) are actually defined in term of attribute a certain object has. It goes both way: an instance that has the right attribute can be inferred to belong to some class ; an instance of a class inherits all the attributes of this class.

    ReplyDelete
  16. The analogy between typing in programming languages and the use of categories in knowledge representation is very interesting.

    A strongly typed language such as C++ requires each variable to belong to some class (whether it is built-in such as int or a user-defined one) and this helps to catch bugs as well as create faster compiled programs. Weakly typed programs allow variables to represent integers one second, and arrays of strings the next second. The interpretation of variables in such a class-free system is interpretation dependent.

    After reading a bit about duck typing, it does seem very relevant to my post about categories in vision. In fact, the statement about ducks says something deep about human understanding and is relevant to empiricism.

    ReplyDelete