Tombone's Computer Vision Blog: categorization

Showing posts with label categorization. Show all posts

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere. Detectors are not only getting faster, but getting stylish. Edges are making a comeback. HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level Categories, Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference. It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection. However this paper is not just about edges. Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees." Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project. It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above. There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space. One great example is how the appearance of cars has changed over the past several decades. Another example is how typical Google Street View images change as a function of going North-to-South in the United States. Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros. I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013. www.neil-kb.com

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper. A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about. This is machine learning, this is AI, this is the future. None of this train on my favourite dataset, test on my favourite dataset bullshit. If there's anybody that's going to do it the right way, its the CMU gang. This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning. The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier? Then this paper is for you. Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection. The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure. Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression. Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010. A recent blog post from Piotr Dollár (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data. Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation. Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues. This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes. While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs. Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos? Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds. Well, now there's an algorithm for that. Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013. sun3d.cs.princeton.edu

Xiao et al, continue their hard-core data collection efforts. Now in 3D. In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers. Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle. In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place. And not any sort of crazy, click-here, click-there interfaces. Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic. For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao. Congratulations on the new faculty position! Princeton has been starving for a person like you. If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly. Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars, ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list. For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.

Thursday, March 08, 2012

"I shot the cat with my proton gun."

I often listen to lectures and audiobooks when I drive more than 2 hours because I don't always have the privilege of enjoying a good conversation with a passenger. Recently I was listening to some philosophy of science podcasts on my iPhone while driving from Boston to New York when the following sentence popped into my head:

"I shot the cat with my proton gun."

I had just listened to three separate Podcasts (one about Kant, one about Wittgenstein and one about Popper) when the sentence came to my mind. What is so interesting about this sentence is that while it is effortless to grasp, it uses two different types of concepts in a single sentence, a "proton gun" and a "cat." It is a perfectly normal sentence, and the above illustration describes the sentence fairly well (photo credits to http://afashionloaf.blogspot.com/2010/03/cat-nap-mares.html for the kitty, and http://www.colemanzone.com/ for the proton gun).

Cat == an "everyday" empirical concept

"Cat" is an everyday "empirical" concept, a concept with which most people have first hand experience (i.e., empirical knowledge). It is commonly believed that such everyday concepts are acquired by children at a young age -- it is an exemple of a basic level concept which people like Immanuel Kant and Ludwig Wittgenstein discuss at great length. We do not need a theory of cats for the idea of a cat to stick.

Image from shadowpaw99

Proton Gun == a "scientific" theoretical concept

On the other extreme is the "proton gun." It is an example of a theoretical concept -- a type of concept which rests upon classroom (i.e., "scientific") knowledge. The idea of a proton gun is akin to the idea of Pluto, an esophagus or cancer -- we do not directly observe such entities, we learn about them from books and by seeing illustrations such as the one below. Such theoretical constructs are the the entities which Karl Popper and the Logical Positivists would often discuss.

While many of us have never seen a proton (nor a proton gun), it is a perfectly valid concept to invoke in my sentence. If you have a scientific background, then you have probably seen so many artistic renditions of protons (see Figure below) and spent so many endless nights studying for chemistry and physics exams, that the word proton conjures a mental image. It is hard for me to thing of entities which trigger mental imagery as non-empirical.

How do we learn such concepts? The proton gun comes from scientific education! The cat comes from experience! But since the origins of the concept "proton" and the concept "cat" are so disjoint, our (human) mind/brain must be more-amazing-than-previously-thought because we have no problem mixing such concepts in a single clause. It does not feel like these two different types of concepts are stored in different parts of the brain.

The idea which I would like you, the reader, to entertain over the next minute or so is the following:

Perhaps the line between ordinary "empirical" concepts and complex "theoretical" concepts is an imaginary boundary -- a boundary which has done more harm than good.

One useful thing I learned from Philosophy of Science, is that it is worthwhile to doubt the existence of theoretical entities. Not for iconoclastic ideals, but for the advancement of science! Descartes' hyperbolic doubt is not dead. Another useful thing to keep in mind is Wittgenstein's Philosophical Investigations and his account of the acquisition of knowledge. Wittgenstein argued elegantly that "everyday" concepts are far from "easy-to-define." (see his family resemblances argument and the argument on defining a "game.") Kant, with his transcendental aesthetic, has taught me to question a hardcore empiricist account of knowledge.

So then, as good cognitive scientists, researchers, and pioneers in artificial intelligence, we must also doubt the rigidity of those everyday concepts which appear to us so ordinary. If we want to build intelligent machines, then we must be ready to break down own understanding of reality, and not be afraid to questions things which appear unquestionable.

In conclusion, if you find popular culture reference more palatable than my philosophical pseudo-science mumbo-jumbo, then let me leave you with two inspirational quotes. First, let's not forget Pink Floyd's lyrics which argued against the rigidity of formal education: "We don't need no education, We don't need no thought control." And the second, a misunderstood, yet witty aphorism which comes to us from Dr. Timothy Leary reminds us that there is a time for education and there is a time for reflection. In his own words: "Turn on, tune in, drop out."

Sunday, June 13, 2010

everything is misc -- torralba cvpr paper to check out

Weinberger's Everything is Miscellaneous is a delightful read -- I just finished it today while flying from PIT to SFO. It was recommended to me by my PhD advisor, Alyosha, and now I can see why! Many of the key motivations behind my current research on object representation deeply resonate in Weinberger's book.

Weinberger motivates Rosch's theory of categorization (the Prototype Model), and explains how it is a significant break from the thousand years of Aristotelian thought. Aristotle gave us the notion of a category -- centered around the notion of a definition. For Aristotle, every object can be stripped to its essential core, and place in its proper place in a God's-eye objective organization of the world. It was Rosch who showed us that categories are much fuzzier and more hectic than suggested by the rigid Aristotelian system. Just like Copernicus single-handedly stopped the Sun and set the Earth in motion, Rosch disintegrated our neatly organized world-view and demonstrated how an individual's path through life shapes h/er concepts.

I think it is fair to say that my own ideas as well as Weinberger's aren't so much an extension of the Roschian mode of thought, but also a significant break from the entire category-based way of thinking. Given that Rosch studied Wittgenstein as a student, I'm surprised her stance wasn't more extreme, more along the anti-category line of thought. I don't want to undermine her contribution to psychology and computer science in any way, and I want to be clear that she should only be lauded for her remarkable research. Perhaps Wittgenstein was as extreme and iconoclastic as I like my philosophers to be, but Rosch provided us with a computational theory and not just a philosophical lecture.

From my limited expertise in theories of categorization in the field of Psychology, whether it is Prototype Models or the more recent data-driven Exemplar Models, these theories are still theories of categories. Whether the similarity computations are between prototypes and stimuli, or between exemplars and stimuli, the output of a categorization model is still a category. Weinberger is all about modern data-driven notions of knowledge organization, in a way that breaks free from the imprisoning notion of a category. Knowledge is power, so why imprison it in rigid modules called categories? Below is a toy visualization of a web of concepts, as imagined by me. This is very much the web-based view of the world. Wikipedia is a bunch of pages and links.

Artistic rendition of a "web of concepts"

I found it valuable to think of the Visual Memex, the model I'm developing in my thesis research, as an anti-categorization model of knowledge -- a vast network of object-object relationships. The idea of using little concrete bits of information to create a rich non-parametric web is the recurring theme in Weinberger's book. In my case, the problem of extracting primitives from images, and all of the problem in dealing with real-world images are around to plague me, and the Visual Memex must rely on many Computer Vision techniques -- such things are not discussed in Weinberger's book. The "perception" or "segmentation" component of the Visual Memex is not trivial -- where linking words on the web is much easier.

CVPR paper to look out for

However, the category-based view is all around us. I expect most of this year's CVPR papers to fit in this category-based view of the world. One paper, co-authored by the great Torralba, looks relevant to my interests. It is yet another triumph for the category-based mentality in computer vision. In fact, one of the figures in the paper demonstrates the category-based view of the world very well. Unlike the memex, the organization is explicit in the following figure:

Exploiting Hierarchical Context on a Large Database of Object Categories

Exploiting Hierarchical Context on a Large Database of Object Categories

Myung Jin Choi, Joseph Lim, Antonio Torralba, and Alan S. Willsky. CVPR 2010.

Monday, April 05, 2010

Ontology is Overrated: Categories, Links, and Tags

This is the title of a powerful treatise written by Clay Shirky, in which he strives to "convince you that a lot of what we think we know about categorization is wrong." Much thanks to David Weinberger's blog www.everythingismiscellaneous.com for pointing out this article. The take home message is quite similar to some of the "Beyond Categories" ideas I've tried to promulgate in my meager attempt to understand why progress in computer vision has reached a standstill. For anybody interested in understanding the limitations of classical systems of categorization, this article is a worth a read.

Tuesday, November 24, 2009

Understanding the role of categories in object recognition

If we set aside our academic endeavors of publishing computer vision and machine learning papers and sincerely ask ourselves, "What is the purpose of recognition?" a very different story emerges.

Let me first outline the contemporary stance on recognition (that is object recognition as is embraced by the computer vision community), which is actually a bit of a "non-stance" because many people working on recognition haven't bothered to understand the motivations, implications, and philosophical foundations of their work. The standard view of recognition is that it is equivalent to categorization -- assigning an object its "correct" category is the goal of recognition. Object recognition, as is found in vision papers, is commonly presented as single image recognition task which is not tied to an active and mobile agent that must understand and act in an environment around them. These contrived tasks are partially to blame for making us think that categories are the ultimate truth. Of course, once we've pinpointed the correct category we can look up information about the object category at hand in some sort of grand encyclopedia. For example, once we've categorized an object as a bird we can simply recall the fact that "it flies" from such a source of knowledge.

Most object recognition research is concerned with object representations (what features to compute from an image) as well as supervised (and semi-supervised) machine learning techniques to learn object models from data in order to discriminate and thus "recognize" object categories. The reason why object recognition has become so popular in the recent decade is that many researchers in AI/Robotics envision a successful vision system as a key component in any real-world robotic platform. If you ask a human to describe their environment, we will probably use a bunch of nouns to enumerate the stuff around them, so surely nouns must be the basic building blocks of reality! In this post I want to question this commonsense assumption that categories are the building blocks of reality and propose a different way of coping with reality, one that doesn't try to directly estimate a category from visual data.

I argue that just because nouns (and the categories they refer to) are the basis of effability for humans, it doesn't mean that nouns and categories are the quarks and gluons of recognition. Language is a relatively recent phenomenon for humans (think evolutionary scale here), and it is absent in many animals inhabiting the earth beside us. It is absurd to think that animals do not possess a faculty for recognition just because they do not have a language. Since animals can quite effectively cope with the world around them, there must be hope for understanding recognition in a way that doesn't invoke linguistic concepts.

Let me make my first disclaimer. I am not against categories altogether -- they have their place. The goal of language is human-human communication and intelligent robotic agents will inevitably have to map their internal modes of representation onto human language if we are to understand and deal with such artificial beings. I just want to criticize the idea that categories are found deep within our (human) neural architecture and serve as the basis for recognition.

Imagine a caveman and his daily life which requires quite a bit of "recognition"-abilities to cope with the world around him. He must differentiate pernicious animals from edible ones, distinguish contentious cavefolk from his peaceful companions, and reason about the plethora of plants around him. For each object that he recognizes, he must be able to determine whether it is edible, dangerous, poisonous, tasty, heavy, warm, etc. In short, recognition amounts to predicting a set of attributes associated with an object. Recognition is the linking of perceptible attributes (it is green and the size of my fist) to our past experiences and predicting attributes that are not conveyed by mere appearance. If we see a tiger, it is solely on our past experiences that we can call it dangerous.

So imagine a vector space, where each dimension encodes an attribute such as edible, throwable, tasty, poisonous, kind, etc. Each object can be represented as a point in this attribute space. It is language that gives us categories as a shorthand to talk about commonly found objects. Different cultures would give rise to different ways of cutting up the world, and this is consistent with what has been observed by psychologists. Viewing categories as a way of compressing attribute vectors not only makes sense but is in agreement with the idea that categories culturally arose much later than the ability for humans to recognize objects. Thus it makes sense to think of category-free recognition. Since a robotic agent who was programmed to think of the world in terms of categories will have to unroll categories to understand objects in terms of tangible properties if they are to make sense of the world around them, why not use the properties/attributes as the primary elements of recognition in the first place!?

from Describing objects by their attributes

These ideas are not entirely new. In Computer Vision, there is a CVPR 2009 paper Describing objects by their attributes by Farhadi, Endres, Hoiem, and Forsyth (from UIUC) which strives to understand objects directly using the ideas discussed above. In the domain of thought recognition, the paper Zero-Shot Learning with Semantic Output Codes by Palatucci, Pomerleau, Hinton, and Mitchell strives to understand concepts in a similar semantic basis.

I believe the field of computer vision has been conceptually stuck and the vehement reliance on rigid object categories is partially to blame. We should read more Wittgenstein and focus more on understanding vision as a mere component of artificial intelligence. If we play the recognize objects in a static image game (as Computer Vision is doing!) then we obtain a fragmented view of reality and cannot fully understand the relationship between recognition and intelligence.

Thursday, November 05, 2009

The Visual Memex: Visual Object Recognition Without Categories

Figure 1

I have discussed the limitations of using rigid object categories in computer vision, and my CVPR 2008 work on Recognition as Association was a move towards developing a category-free model of objects. I was primarily concerned with local object recognition where the recognition problem was driven by the appearance/shape/texture features derived from within a segment (a region extraction from an image using an image segmentation algorithm). Recognition of objects was done locally and independently per region, since I did not have good model of category-free context at that time. I've given the problem of contextual object reasoning much thought over the past several years, and equipped with the power of graphical models and learning algorithms I now present a model for category-free object relationship reasoning.

Now its 2009, and its no surprise that I have a paper on context. Context is the new beast and all the cool kids are using it for scene understanding; however, categories are used so often for this problem that their use is rarely questioned. In my NIPS 2009 paper, I present a category-free model of object relationships and address the problem of context-only recognition where the goal is to recognize an object solely based on contextual cues. Figure 1 shows an example of such a prediction task. Given K objects and their spatial configuration, is it possible to predict the appearance of a hidden object at some spatial location?

Figure 2

I present a model called the Visual Memex (visualized in Figure 2), which is a non-parametric graph-based model of visual concepts and their interactions. Unlike traditional approaches to object-object modeling which learn potentials between every pair of categories (the number of such pairs scales quadratically with the number of categories), I make no category assumptions for context.

The official paper is out, and can be found on my project page:

Tomasz Malisiewicz, Alexei A. Efros. Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships. In NIPS, December 2009. PDF

Abstract: The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the object's relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralba's proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.

I gave at talk about my work yesterday at CMU's Misc-read and received some good feedback. I'll be at NIPS this December representing this body of research.

Monday, October 26, 2009

Wittgenstein's Critique of Abstract Concepts

In his Philosophical Investigations, Wittgenstein argues against abstraction -- via several thought experiments he strives to annihilate the view that during their lives humans develop neat and consistent concepts in their minds (akin to building a dictionary). He criticizes the commonplace notions of meaning and concept formation (as were commonly used in philosophical circles at the time) and has contributed greatly to my own ideas regarding categorization in computer vision.

Wittgenstein asks the reader to come up with the definition of the concept "game." While we can look up the definition of "game" in a dictionary, we can't help but feel that any definition will be either too narrow or too broad. The number of exceptions we would need in a single definition scales as the number of unique games we've been exposed to. His point wasn't that game cannot be defined -- it was that the lack of a formal definition does not prevent us from using the word "game" correctly. Think of a child growing up and being exposed to multi-player games, single-player games, fun games, competitive games, games that are primarily characterized by their display of athleticism (aka sports or Olympic Games). Let's not forget activities such as courting and the Stock Market which are also referred to as "games." Wittgenstein criticizes the idea that during our lives we somehow determine what is common between all of those examples of games and form an abstract concept of game which determines how we categorize novel activities. For Wittgenstein, our concept of game is not much more than our exposure to activities labeled as games and our ability to re-apply the word game in future context.

Wittgenstein's ideas are an antithesis to Platonic Realism and Aristotle's Classical notion of Categories, where concepts/categories are pure, well-defined, and possess neatly defined boundaries. For Wittgenstein, experience is the anchor which allows us to measure the similarity between a novel activity and past activities referred to as games. Maybe the ineffability of experience isn't because internal concepts are inaccessible to introspection, maybe there is simply no internal library of concepts in the first place.

An experience-based view of concepts (or as my advisor would say, a data-driven theory of concepts) suggests that there is no surrogate for living a life rich with experience. While this has implications for how one should live their own life, it also has implications in the field of artificial intelligence. The modern enterprise of "internet vision" where images are labeled with categories and fed into a classifier has to be questioned. While I have criticized categories, there are also problems with a purely data-driven large-database-based approach. It seems that a good place to start is by pruning away redundant bits of information; however, judging what is redundant and how is still an open question.

Friday, June 19, 2009

A Shift of Focus: Relying on Prototypes versus Support Vectors

The goal of today's blog post is to outline an important difference between traditional categorization models in Psychology such as Prototype Models, and Support Vector Machine (SVM) based models.

When solving a SVM optimization problem in the dual (given a kernel function), the answer is represented as a set of weights associated with each of the data-centered kernels. In the Figure above, a SVM is used to learn a decision boundary between the blue class (desks) and the red class (chairs). The sparsity of such solutions means that only a small set of examples are used to define the class decision boundary. All points on the wrong side of the decision boundary and barely yet correctly classified points (within the margin) have non-zero weights. Many Machine Learning researchers get excited about the sparsity of such solutions because in theory, we only need to remember a small number of kernels for test time. However, the decision boundary is defined with respect to the problematic examples (misclassified and barely classified ones) and not the most typical examples. The most typical (and easy to recognize) examples are not even necessary to define the SVM decision boundary. Two data sets that have the same problematic examples, but significant differences in the "well-classified" examples might result in the same exact SVM decision boundary.

My problem with such boundary-based approaches is that by focusing only on the boundary between classes useful information is lost. Consider what happens when two points are correctly classified (and fall well beyond the margin on their correct side): the distance-to-decision-boundary is not a good measure of class membership. By failing to capture the "density" of data, the sparsity of such models can actually be a bad thing. As with discriminative methods, reasoning about the support vectors is useful for close-call classification decisions, but we lose fine-scale membership details (aka "density information") far from the decision surface.

In a single-prototype model (pictured above), a single prototype is used per class and distances-to-prototypes implicitly define the decision surface. The focus is on exactly the 'most confident' examples, which are the prototypes. Prototypes are created during training -- if we fit a Gaussian distribution to each class, the mean becomes the prototype. Notice that by focusing on Prototypes, we gain density information near the prototype at the cost of losing fine-details near the decision boundary. Single-Prototype models generally perform worse on forced-choice classification tasks when compared to their SVM-based discriminative counterparts; however, there are important regimes where too much emphasis on the decision boundary is a bad thing.

In other words, Prototype Methods are best and what they were designed to do in categorization, namely capture Typicality Effects (see Rosch). It would be interesting to come up with more applications where handing Typicality Effects and grading membership becomes more important than making close-call classification decision. I suspect that in many real-world information retrieval applications (where high precision is required and low recall tolerated) going beyond boundary-based techniques is the right thing to do.

Friday, June 12, 2009

Exemplars, Prototypes, and towards a Theory of Concepts for AI

While initial musings (and some early theories) on Categorization come from Philosophy (think Categories by Aristotle), most modern research on Categorization which adheres to the scientific method comes from Psychology (Concept Learning on Wikipedia). Two popular models which originate from Psychology literature are Prototype Theory and Exemplar Theory. Summarizing briefly, categories in Prototype Theory are abstractions which summarize a category while categories in Exemplar Theory are represented nonparametrically. While I'm personally a big proponent of Exemplar Theory (see my Recognition by Association CVPR2008 paper), I'm not going to discuss the details of my philosophical stance in this post. I want to briefly point out the shortcomings of these two simplified views of concepts.

Researchers focusing on Categorization are generally dealing with a very simplified (and overly academic) view of the world -- where the task is to categorize a single input stimulus. The problem is that if we want a Theory of Concepts that will be the backbone of intelligent agents, we have to deal with relationships between concepts with as much fervor as the representations of concepts themselves. While the debate concerning exemplars vs. prototypes has been restricted to these single stimulus categorization experiments, it is not clear to me why we should prematurely adhere to one of these polarized views before we consider how we can make sense of inter-category relationships. In other words, if an exemplar-based view of concepts looks good (so-far) yet it is not as useful for modeling relationships as a prototype-view, then we have to change our views. Following James' pragmatic method, we should evaluate category representations with respect to a larger system embodied in an intelligent agent (and its ability to cope with the world) and not the overly academic single-stimulus experiments dominating experimental psychology.

On another note, I submitted my most recent research to NIPS last week (supersecret for now), and went to a few Phish concerts. I'm driving to California next week and I start at Google at the end of June. I also started reading a book on James and Wittgenstein.

Thursday, March 26, 2009

Beyond Categorization: Getting Away From Object Categories in Computer Vision

Natural language evolved over thousands of years to become the powerful tool that is is today. When we say things using language to convey our experiences with the world, we can't help but refer to object categories. When we say things such as "this is a car" what we are actually saying is "this is an instance from the car category." Categories let us get away from referring to individual object instances -- in most cases knowing that something belongs to a particular category is more than enough knowledge to deal with it. This is a type of "understanding by compression" or understanding by abstracting away the unnecessary details. In the words of Rosch, "the task of category systems is to provide maximum information with the least cognitive effort." Rosch would probably agree that it only makes sense to talk about the utility of a category system (a for getting a grip on reality) as opposed to the truth value of a category system with respect how well it aligns to observer-independent reality. The degree of pragmatism expressed by Rosch is something that William James would have been proud of.

From a very young age we are taught language and soon it takes over our inner world. We 'think' in language. Language provides us with a list of nouns -- a way of cutting up the world into categories. Different cultures have different languages that cut up the world differently and one might wonder how well the object categories contained in any given single language correspond to reality -- if it even makes sense to talk about an observer independent reality. Rosch would argue that human categorization is the result of "psychological principles of categorization" and is more related to how we interact with the world than how the world is. If the only substances we ingested for nutrients were types of grass, then categorizing all of the different strains of grass with respect to flavor, vitamin content, color, etc would be beneficial for us (as a species). Rosch points out in her works that her ideas refer to categorization at the species-level and she calls it human categorization. She is not referring to a personal categorization; for example, the way a child might cluster concepts when he/she starts learning about the world.

It is not at all clear to me whether we should be using the categories from natural language as the to-be-recognized entities in our image understanding systems. Many animals do not have a language with which they can compress percepts into neat little tokens -- yet they have no problem interacting with the world. Of course, if we want to build machines that understand the world around them in a way that they can communicate with us (humans), then language and its inherent categorization will play a crucial role.

While we ultimately use language to convey our ideas to other humans, how early are the principles of categorization applied to perception? Is the grouping of percepts into categories even essential for perception? I doubt that anybody would argue that language and its inherent categorization is not useful for dealing with the world -- the only question is how it interacts with perception.

Most computer vision researchers are stuck in the world of categorization and many systems rely on categorization at a very early stage. A problem with categorization is its inability to deal with novel categories -- something which humans must deal with at a very young age. We (humans) can often deal with arbitrary input and using analogies can still get a grip and the world around us (even when it is full of novel categories). One hypothesis is that at the level of visual perception things do not get recognized into discrete object classes -- but a continuous recognition space. Thus instead of asking the question, "What is this?" we focus on similarity measurements and ask "What is this like?". Such a comparison-based view would help us cope with novel concepts.