Thursday, December 10, 2009

Computer Vision Papers at NIPS 2009

Here I some computer vision papers I found interesting at NIPS 2009 in Vancouver.

[pdf][bib] Unsupervised Detection of Regions of Interest Using Iterative Link Analysis (NIPS 2009)
Gunhee Kim, Antonio Torralba

[pdf][bib] Region-based Segmentation and Object Detection (NIPS 2009)
Stephen Gould, Tianshi Gao, Daphne Koller

[pdf][bib] Segmenting Scenes by Matching Image Composites (NIPS 2009)
Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman

[pdf][bib] Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships (NIPS 2009)
Tomasz Malisiewicz, Alyosha Efros

Sunday, December 06, 2009

On Science: passion towards solving a problem must come from within

I'm currently in Chicago while traveling to Vancouver (NIPS 2009 conference) where I'll be defending my research during Tuesday's poster session. Instead of delving into the computational challenges that motivate my research, I want to take a step back and criticize what (sometimes? often?) happens during the publishing cycle.

According to me, good research starts with the passion to solve a particular problem or address a specific concern. Quite often, good research will raise more questions than it successfully solves. Unfortunately, when we submit papers to conferences we are judged on the clarity of presentation, level of experimental validation, as well as overall completeness. This means that the publishing cycle quite often promotes writing "cute" papers that have little long term impact in the field and can only be viewed as thorough and complete due to their narrow scope. This is why we should not solely rely on peer review nor cater our scientific lives towards pleasing others. Sometimes being a good scientist means breaking free from the norms that the world around us rigidly follows, sometimes publishing too often skews our research focus, and sometimes falling off the face of the earth for a period of time is necessary to push science in a new direction.

I want to challenge every scientist to follow their dreams and attempt to solve the problems they truly care about and not just attempt to please peer review. Maybe some think that the perversion of science (that is evaluating scientists by the number of publications they have) is okay, but in my book a scientific career which produces a single grand idea is superior to a career saturated with myriad "cute and thorough" papers. I'm not particularly upset with the progress of Computer Vision, but I think more people should ponder the negative consequences of pulling the publish-trigger too often.

Tuesday, November 24, 2009

Understanding the role of categories in object recognition

If we set aside our academic endeavors of publishing computer vision and machine learning papers and sincerely ask ourselves, "What is the purpose of recognition?" a very different story emerges.

Let me first outline the contemporary stance on recognition (that is object recognition as is embraced by the computer vision community), which is actually a bit of a "non-stance" because many people working on recognition haven't bothered to understand the motivations, implications, and philosophical foundations of their work. The standard view of recognition is that it is equivalent to categorization -- assigning an object its "correct" category is the goal of recognition. Object recognition, as is found in vision papers, is commonly presented as single image recognition task which is not tied to an active and mobile agent that must understand and act in an environment around them. These contrived tasks are partially to blame for making us think that categories are the ultimate truth. Of course, once we've pinpointed the correct category we can look up information about the object category at hand in some sort of grand encyclopedia. For example, once we've categorized an object as a bird we can simply recall the fact that "it flies" from such a source of knowledge.

Most object recognition research is concerned with object representations (what features to compute from an image) as well as supervised (and semi-supervised) machine learning techniques to learn object models from data in order to discriminate and thus "recognize" object categories. The reason why object recognition has become so popular in the recent decade is that many researchers in AI/Robotics envision a successful vision system as a key component in any real-world robotic platform. If you ask a human to describe their environment, we will probably use a bunch of nouns to enumerate the stuff around them, so surely nouns must be the basic building blocks of reality! In this post I want to question this commonsense assumption that categories are the building blocks of reality and propose a different way of coping with reality, one that doesn't try to directly estimate a category from visual data.

I argue that just because nouns (and the categories they refer to) are the basis of effability for humans, it doesn't mean that nouns and categories are the quarks and gluons of recognition. Language is a relatively recent phenomenon for humans (think evolutionary scale here), and it is absent in many animals inhabiting the earth beside us. It is absurd to think that animals do not possess a faculty for recognition just because they do not have a language. Since animals can quite effectively cope with the world around them, there must be hope for understanding recognition in a way that doesn't invoke linguistic concepts.

Let me make my first disclaimer. I am not against categories altogether -- they have their place. The goal of language is human-human communication and intelligent robotic agents will inevitably have to map their internal modes of representation onto human language if we are to understand and deal with such artificial beings. I just want to criticize the idea that categories are found deep within our (human) neural architecture and serve as the basis for recognition.


Imagine a caveman and his daily life which requires quite a bit of "recognition"-abilities to cope with the world around him. He must differentiate pernicious animals from edible ones, distinguish contentious cavefolk from his peaceful companions, and reason about the plethora of plants around him. For each object that he recognizes, he must be able to determine whether it is edible, dangerous, poisonous, tasty, heavy, warm, etc. In short, recognition amounts to predicting a set of attributes associated with an object. Recognition is the linking of perceptible attributes (it is green and the size of my fist) to our past experiences and predicting attributes that are not conveyed by mere appearance. If we see a tiger, it is solely on our past experiences that we can call it dangerous.

So imagine a vector space, where each dimension encodes an attribute such as edible, throwable, tasty, poisonous, kind, etc. Each object can be represented as a point in this attribute space. It is language that gives us categories as a shorthand to talk about commonly found objects. Different cultures would give rise to different ways of cutting up the world, and this is consistent with what has been observed by psychologists. Viewing categories as a way of compressing attribute vectors not only makes sense but is in agreement with the idea that categories culturally arose much later than the ability for humans to recognize objects. Thus it makes sense to think of category-free recognition. Since a robotic agent who was programmed to think of the world in terms of categories will have to unroll categories to understand objects in terms of tangible properties if they are to make sense of the world around them, why not use the properties/attributes as the primary elements of recognition in the first place!?



These ideas are not entirely new. In Computer Vision, there is a CVPR 2009 paper Describing objects by their attributes by Farhadi, Endres, Hoiem, and Forsyth (from UIUC) which strives to understand objects directly using the ideas discussed above. In the domain of thought recognition, the paper Zero-Shot Learning with Semantic Output Codes by Palatucci, Pomerleau, Hinton, and Mitchell strives to understand concepts in a similar semantic basis.

I believe the field of computer vision has been conceptually stuck and the vehement reliance on rigid object categories is partially to blame. We should read more Wittgenstein and focus more on understanding vision as a mere component of artificial intelligence. If we play the recognize objects in a static image game (as Computer Vision is doing!) then we obtain a fragmented view of reality and cannot fully understand the relationship between recognition and intelligence.

Thursday, November 12, 2009

Learning and Inference in Vision: from Features to Scene Understanding


Tomorrow, Jonathan Huang and I are giving a Computer Vision tutorial at the First MLD (Machine Learning Department) Research Symposium at CMU. The title of our presentation is Learning and Inference in Vision: from Features to Scene Understanding.

The goal of the tutorial is to expose Machine Learning students to state-of-the-art object recognition, scene understanding and the inference problems associated with such high-level recognition problems. Our target audience is graduate students with little or no prior exposure to object recognition who would like to learn more about the use of probabilistic graphical models in Computer Vision. We outline the difficulties present in object recognition/detection and outline several different models for jointly reasoning about multiple object hypotheses.

Saturday, November 07, 2009

A model of thought: The Associative Indexing of the Memex

The Memex "Memory Extender" is an organizational device, a conceptual device, and a framework for dealing with conceptual relationships in an associative way. Abandoning the Aristotelian tradition of rooting concepts in definitions, the Memex suggests an association-based, non-parametric, and data-driven representation of concepts.

Since the mind=software analogy is so deeply engraved in my thoughts, it is hard for me to see intelligent reasoning as anything but a computer program (albeit one which we might never discover/develop). It is worthwhile to see sketches of the memex from an era before computers. (See Figure below). However, with the modern Internet, a magnificent example of a Bush's ideology, with links denoting the associations between pages, we need no better analogy. Bush's critique of the artificiality of traditional schemes of indexing resonates in the world wide web.


A Mechanical Memex Sketch

By extrapolating Bush's anti-indexing argument to visual object recognition, I realize that the blunder is to assign concepts to rigid categories. The desire to break free from categorization was the chief motivation for my Visual Memex paper. If Bush's ideas were so successful in predicting the modern Internet, we should ask ourselves, "Why are categories so prevalent in computational models of perception?" Maybe it is machine learning, with its own tradition of classes in supervised learning approaches, that has scarred the way we computer scientists see reality.

“The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature.” -- Vannevar Bush

Is Google's grasp of the world of information anything more than a Memex? I'm not convinced that it is not. While the feat of searching billions of web pages in real time has already been demonstrated by Google (and reinforced every day), the best computer vision approaches as of today resemble nothing like Google's data-driven way of representing concepts. I'm quite interested in pushing this link-based data-driven mentality to the next level in the field of Computer Vision. Breaking free from the categorization assumptions that plague computational perception might the the key ingredient in the recipe for success.

Instead of summarizing, here is another link to a well-written article on the Memex by Finn Brunton. Quoting Brunton, "The deepest implications of the Memex would begin to become apparent here: not the speed of retrieval, or even the association as such, but the fact that the association is arbitrary and can be shared, which begins to suggest that, at some level, the data itself is also arbitrary within the context of the Memex; that it may not be “the shape of thought,” emphasis on the the, but that it is the shape of a new thought, a mediated and mechanized thought, one that is described by queries and above all by links."

Thursday, November 05, 2009

The Visual Memex: Visual Object Recognition Without Categories


Figure 1
I have discussed the limitations of using rigid object categories in computer vision, and my CVPR 2008 work on Recognition as Association was a move towards developing a category-free model of objects. I was primarily concerned with local object recognition where the recognition problem was driven by the appearance/shape/texture features derived from within a segment (a region extraction from an image using an image segmentation algorithm). Recognition of objects was done locally and independently per region, since I did not have good model of category-free context at that time. I've given the problem of contextual object reasoning much thought over the past several years, and equipped with the power of graphical models and learning algorithms I now present a model for category-free object relationship reasoning.

Now its 2009, and its no surprise that I have a paper on context. Context is the new beast and all the cool kids are using it for scene understanding; however, categories are used so often for this problem that their use is rarely questioned. In my NIPS 2009 paper, I present a category-free model of object relationships and address the problem of context-only recognition where the goal is to recognize an object solely based on contextual cues. Figure 1 shows an example of such a prediction task. Given K objects and their spatial configuration, is it possible to predict the appearance of a hidden object at some spatial location?

Figure 2


I present a model called the Visual Memex (visualized in Figure 2), which is a non-parametric graph-based model of visual concepts and their interactions. Unlike traditional approaches to object-object modeling which learn potentials between every pair of categories (the number of such pairs scales quadratically with the number of categories), I make no category assumptions for context.

The official paper is out, and can be found on my project page:

Tomasz Malisiewicz, Alexei A. Efros. Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships. In NIPS, December 2009. PDF

Abstract: The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the object's relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralba's proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.

I gave at talk about my work yesterday at CMU's Misc-read and received some good feedback. I'll be at NIPS this December representing this body of research.

Monday, October 26, 2009

Wittgenstein's Critique of Abstract Concepts

In his Philosophical Investigations, Wittgenstein argues against abstraction -- via several thought experiments he strives to annihilate the view that during their lives humans develop neat and consistent concepts in their minds (akin to building a dictionary). He criticizes the commonplace notions of meaning and concept formation (as were commonly used in philosophical circles at the time) and has contributed greatly to my own ideas regarding categorization in computer vision.

Wittgenstein asks the reader to come up with the definition of the concept "game." While we can look up the definition of "game" in a dictionary, we can't help but feel that any definition will be either too narrow or too broad. The number of exceptions we would need in a single definition scales as the number of unique games we've been exposed to. His point wasn't that game cannot be defined -- it was that the lack of a formal definition does not prevent us from using the word "game" correctly. Think of a child growing up and being exposed to multi-player games, single-player games, fun games, competitive games, games that are primarily characterized by their display of athleticism (aka sports or Olympic Games). Let's not forget activities such as courting and the Stock Market which are also referred to as "games." Wittgenstein criticizes the idea that during our lives we somehow determine what is common between all of those examples of games and form an abstract concept of game which determines how we categorize novel activities. For Wittgenstein, our concept of game is not much more than our exposure to activities labeled as games and our ability to re-apply the word game in future context.

Wittgenstein's ideas are an antithesis to Platonic Realism and Aristotle's Classical notion of Categories, where concepts/categories are pure, well-defined, and possess neatly defined boundaries. For Wittgenstein, experience is the anchor which allows us to measure the similarity between a novel activity and past activities referred to as games. Maybe the ineffability of experience isn't because internal concepts are inaccessible to introspection, maybe there is simply no internal library of concepts in the first place.

An experience-based view of concepts (or as my advisor would say, a data-driven theory of concepts) suggests that there is no surrogate for living a life rich with experience. While this has implications for how one should live their own life, it also has implications in the field of artificial intelligence. The modern enterprise of "internet vision" where images are labeled with categories and fed into a classifier has to be questioned. While I have criticized categories, there are also problems with a purely data-driven large-database-based approach. It seems that a good place to start is by pruning away redundant bits of information; however, judging what is redundant and how is still an open question.

Monday, October 19, 2009

Scene Prototype Models for Indoor Image Recognition

In today's post I want to briefly discuss a computer vision paper which has caught my attention.

In the paper Recognizing Indoor Scenes, Quattoni and Torralba build a scene recognition system for categorizing indoor images. Instead of performing learning directly in descriptor space (such as the GIST over the entire image), the authors use a "distance-space" representation. An image is described by a vector of distances to a large number of scene prototypes. A scene prototype consists of a root feature (the global GIST) as well as features belonging to a small number of regions associated with the prototype. One example of such a prototype might be an office scene with a monitor region in the center of the image and a keyboard region below it -- however the ROIs (which can be thought of as parts of the scene) are often more abstract and do not neatly correspond to a single object.


The learning problem (which is solved once per category) is then to find the internal parameters of each prototype as well as the per-class prototype distance weights which are used for classification. From a distance function learning point of view, it is rather interesting to see distances to many exemplars being used as opposed to the distance to a single focal exemplar.

Although the authors report results on the image categorization task it is worthwhile to ask if scene prototypes could be used for object localization. While it is easy to be the evil genius and devise an image that is unique enough such that it doesn't conform to any notion of a prototype, I wouldn't be surprised if 80% of the images we encounter on the internet conform to a few hundred scene prototypes. Of course the problem of learning such prototypes from data without prototype-labeling (which requires expert vision knowledge) is still open. Overall, I like the direction and ideas contained in this research paper and I'm looking forward to see how these ideas develop.

Tuesday, October 13, 2009

What is segmentation-driven object recognition?

In this post, I want to discuss what the term "segmentation-driven object recognition" means to me. While segmentation-only and object recognition-only research papers are ubiquitous in vision conferences (such as CVPR , ICCV, and ECCV), a new research direction which uses segmentation for recognition has emerged. Many researchers pushing in this direction are direct descendants of the great J. Malik such as Belongie, Efros, Mori, and many others. The best example of segmentation-driven recognition can be found in Rabinovich's Objects in Context paper. The basic idea in this paper is to compute multiple stable segmentations of an input image using Ncuts and use a dense probabilistic graphical model over segments (combining local terms and segment-segment context) to recognize objects inside those regions.

Segmentation-only research focuses on the actual image segmentation algorithms -- where the output of a segmentation algorithm is a partition of a 2D image into contiguous regions. Algorithms such as mean-shift, normalized cuts, as well as 100s of probabilistic graphical models can be used produce such segmentations. The Berkeley group (in an attempt to salvage "mid-level" vision) has been working diligently on boundary detection and image segmentation for over a decade.

Recognition-only research generally focuses on new learning techniques or building systems to perform well on detection/classification benchmarks. The sliding window approach coupled with bag-of-words models has dominated vision and is the unofficial method of choice.

It is easy to relax the bag-of-words model, so let's focus on rectangles for a second. If we do not use segmentation, the world of objects will have to conform to sliding rectangles and image parsing will inevitably look like this:

(Taken from Bryan Russell's Object Recognition by Scene Alignment paper).

It has been argued that segmentation is required to move beyond the world of rectangular windows if we are to successfully break up images into their constituent objects. While some objects can be neatly approximated by a rectangle in the 2D image plane, to explain away an arbitrary image free-form regions must be used. I have argued this point extensively in my BMVC 2007 paper, and the interesting result was that multiple segmentations must by used if we want to produce reasonable segments. Sadly, segmentation is generally not good enough by itself to produce object-corresponding regions.



(Here is an example of the Mean Shift algorithm where to get a single cow segment two adjacent regions had to be merged.)

The question of how to use segmentation algorithms for recognition is still open. If segmentation could tessellate an image into "good" regions in one-shot then the goal of recognition is to simply label these regions and life becomes simple. This is unfortunately far from reality. While blobs of homogeneous appearance often correspond to things like sky, grass, and road, many objects do not pop out as a single segment. I have proposed using a soup of such segments that come from different algorithms being ran with different parameters (and even merging pairs and triplets of such segments!) but this produces a large number of regions and thus making the recognition task harder.

Using a soup of segments, a small fraction of the regions might be of high quality; however, recognition now has to throw away 1000s of misleading segments. Abhinav Gupta, a new addition to CMU vision community, has pointed out that if we want to model context between segments (and for object-object relationships this means a quadratic dependence on the number of segments), using a large soup of segments in simply not tractable. Either the number of segments or the number of context interactions has to be reduced in this case, but non-quadratic object-object context models are an open question.

In conclusion, the representation used by segmentation (that of free-form regions) is superior to sliding window approaches which utilize rectangular windows. However, off-the-shelf segmentation algorithms are still lacking with respect to their ability to generate such regions. Why should an algorithm that doesn't know anything about objects be able to segment out objects? I suspect that in the upcoming years we will see a flurry of learning-based segmenters that provide a blend of recognition and bottom-up grouping, and I envision such algorithms to be used a strictly non-feedforward way.

Saturday, September 19, 2009

joint regularization slides

Trevor Darrell posted his slides from BAVM about joint regularization across classifier learning. I think this is a really cool and promising idea and I plan on applying it to my own research on local distance function learning when I get back to CMU in October.

The idea is there should be significant overlap between what a cat classifier learns and what a dog classifier learns. So why independently learn two classifiers?

My paper on the Visual Memex got accepted to NIPS 2009
so I will be there representing my work in December. Be sure to read future blog posts about this work which strives to break free from using categories in Computer Vision.

On another note, today was my last day interning at Google (a former Robograd was my mentor) and I will be driving back to Pittsburgh from Mountain View this Sunday. Yosemite is the first stop! I plan on doing some light hiking with my new Vibram Five Fingers! I've been using them for deadlifting and they've been great for both working out and just chilling/coding around the Googleplex.


Tuesday, August 18, 2009

exciting stuff at BAVM2009 #1: joint regularization

There were a couple of cool computer vision ideas that I was exposed to at BAVM2009. First, Trevor Darrell mentioned some cool work by Ariadna Quattoni on L1/L_inf regularization. The basic idea, which has also recently been used in other ICML 2009 works such as Han Liu and Mark Palatucci's Blockwise Coordinate Descent, is that you want to regularize across a bunch of problems. This is sometimes referred to as multi-task learning. Imagine solving two SVM optimization problems to find linear classifiers for detecting cars and bicycles in images. It is reasonable to expect that in high dimensional spaces these two classifiers will something in common. To provide more intuition, it might be the case that your feature set provides many irrelevant variables and when learning these classifiers independently much work is spent on removing these dumb variables. By doing some sort of joint regularization (or joint feature selection), you can share information across seemingly distinct classification problems.

In fact, when I was talking about my own CVPR08 work Daphne Koller suggested that this sort of regularization might work for my task of learning distance functions. However, I am currently exploiting the independence that I get from not doing any cross-problem regularization by solving the distance function learning problems independently. While regularization might be desirable, it couples problems and it might be difficult to solve hundreds of thousands of such problems jointly.

I will mention some other cool works in future posts.

Friday, August 14, 2009

Bay Area Vision Meeting (BAVM 2009): Image and Video Understanding

Tomorrow (Friday) afternoon is BAVM 2009, a Bay Area workshop on Image and Video Understanding, which will be held at Stanford this year. It is being organized by Juan Carlos Niebles, one of Fei-Fei Li's students, and I will be there representing CMU. I have a poster about some new research and getting feedback is always good, but I'm really excited about meeting some of the other graduate students who work on image understanding. The Berkeley group has been pushing hard segmentation-driven image understanding so seeing what they're up to should be interesting. There will also be many fellow Googlers and researchers from companies in the Bay Area so it will also be a good place to network.


I look forward to hearing the invited speakers and the seeing the bleeding-edge stuff during the poster sessions. I'll try to blog a little bit about some of the coolest stuff I encounter when I get back.

Friday, August 07, 2009

Graphviz for Object Recognition Research

Many of the techniques that I employ for object recognition utilize a non-parametric representation of visual concepts. In many such non-parametric models, examples of visual concepts are stored in a database as opposed to "abstracted away" as is commonly done when fitting a parametric appearance model. When designing such non-parametric models, I find it important to visualize the relationships between concepts. The ability to visualize what you're working on creates an intimate link between you and your ideas and can often drive creativity.

One way to visualize a database of exemplar objects, or a "soup of concepts," is as a graph. This generally makes sense when it is meaningful to define an edge between to atoms. While a vector-drawing utility (such as Illustrator) is great for manually putting together graphs for presentations or papers, automated visualization of large graphs is critical for debugging many graph-based algorithms.

A really cool (and secret) figure which I generated using Graphviz somewhat recently can be seen below. I use Matlab to write a simple .dot file and then call something like neato to get the pdf output. Click on the image to see the vectorized pdf automatically produced by Graphviz.

Graphviz generated graph
What does this graph show? Its a secret... (details coming soon)

Friday, July 31, 2009

Simple Newton's Method Fractal code in MATLAB

Due to popular request I've sharing some very simple Newton's Method Fractal code in MATLAB. It produces the following 800x800 image (in about 2.5 seconds on my 2.4Ghz Macbook Pro):


>> [niters,solutions] = matlab_fractal;
>> imagesc(niters)


function [niters,solutions] = matlab_fractal
%Create Newton's Method Fractal Image
%Tomasz Malisiewcz (tomasz@cmu.edu)
%http://quantombone.blogspot.com/
NITER = 40;
threshold = .001;

[xs,ys] = meshgrid(linspace(-1,1,800), linspace(-1,1,800));
solutions = xs(:) + i*ys(:);
select = 1:numel(xs);
niters = NITER*ones(numel(xs), 1);

for iteration = 1:NITER
oldi = solutions(select);

%in newton's method we have z_{i+1} = z_i - f(z_i) / f'(z_i)
solutions(select) = oldi - f(oldi) ./ fprime(oldi);

%check for convergence or NaN (division by zero)
differ = (oldi - solutions(select));
converged = abs(differ) < threshold;
problematic = isnan(differ);

niters(select(converged)) = iteration;
niters(select(problematic)) = NITER+1;
select(converged | problematic) = [];
end

niters = reshape(niters,size(xs));
solutions = reshape(solutions,size(xs));

function res = f(x)
res = (x.^2).*x - 1;

function res = fprime(x)
res = 3*x.^2;


Wednesday, July 15, 2009

Spin Images for object recognition in 3D Laser Data

Today's post is about 3D object recognition, that is localization and recognition of objects from 3D laser data (and not the perception/recovery of 3D from 2D images).

My first exposure to object recognition was in the context of specific object recognition inside 3D laser scans. In specific object recognition, you are looking for 'stapler X' or 'computer keyboard Y' and not just any stapler/computer keyboard. If the computer keyboard was black then it will always be black since we assume intrinsic appearance doesn't change in specific object recognition. This is a different (and easier!) problem than category-based recognition where colors and shapes can change due to intra-class variation.

The problem of specific object 3D recognition I'll be discussing is as follows:

Given M detailed 3D object models, localize all (if any) of these objects (in any spatial configuration) in a 3D laser scan of a scene potentially containing much more stuff than just the objects of interest (aka the clutter).

There was actually quite a lot of research in this style of 3D recognition in the 1990's with the belief that 3D recognition would be much simpler than recognition from 2D images. The idea (Marr's idea, actually) was that object recognition in 2D images would by preceded by object-identity independent 3D surface extraction so that 2D recognition would resemble this version of 3D recognition after some initial geometric processing.

However, it ends up that many of the ambiguities present in 2D imagery were also present in 3D laser data -- the problems of bottom-up perceptual grouping were as difficult in 3D as in 2D. Just because you have 3D locations associated with parts of an object does not make it any easier to tell where the object begins and where it ends (namely the problem of segmentation). It is this inability to segment out objects that resulted in the widespread usage of local descriptors such as SIFT.

Many of today's 2D object recognition problems rely on local descriptors which bypass the problem of segmentation, and it isn't surprising that the 3D recognition problem I described above was elegantly approached by A.E. Johnson and M. Hebert as early as 1997 via a local 3D descriptor known as a Spin Image.



The idea behind a Spin Image is actually very similar to that of a SIFT descriptor used in image-based object recognition. A spin image is a regional point descriptor used to characterize the shape properties of a 3D object with respect to a single oriented point. It is called a "spin" image because the process of creating such a descriptor can be envisioned as spinning a sheet around the axis defined by an oriented point and collecting the contributions of nearby points. Since a point's normal can be computed fairly robustly given its neighboring points, the spin image is highly robust to rigid transformations when defined with respect to this canonical frame. Since it is 2D and not 3D it does lose some discriminative power -- two different yet related surfaces chunks can have the same spin image. The idea behind using this descriptor for recognition is that we can compute many of these descriptors all over the surface of our object models as well as the input 3D laser scan. We then have to perform matching over these descriptors to create some sort of correspondences (potentially spatially verified).

(For a fairly recent overview of spin images as well as other similar regional shape descriptors and their applications to 3D object recognition check out Andrea Frome's ECCV 2004 paper, Recognizing Objects in Range Data Using Regional Point Descriptors.)

Spin images aren't a thing of the past, in fact here is a link to a RSS 2009 paper by Kevin Lai and Dieter Fox which uses spin images (and my local distance function learning approach!):
3D Laser Scan Classification Using Web Data and Domain Adaptation

Friday, July 03, 2009

Linguistic Idealism

I have been an anti-realist since a freshman in college. Due to my lack of philosophical vocabulary I might have even called myself an idealist back then. However, looking back I think it would have been much better to use the word 'anti-realist.' I was mainly opposed to the correspondence theory of truth which presupposes an external, observer independent, reality to which our thoughts and notions are supposed to adhere to. It was in the context of the Philosophy of Science that I acquired my strong anti-realist views, (developing my views while taking Quantum Mechanics, Epistemology, and Artificial Intelligence courses at the same time). Pragmatism -- the offspring of William James -- was the single best view which best summarized my philosophical views. While pragmatism is a rejection of the absolutes, an abandonment of metaphysics, it does not get in the way of making progress in science. It is merely a new a perspective on science, a view that does not undermine the creativity of the creator of scientific theories, a re-rendering of the scientist as more of an artist and less of a machine.

However, pragmatism is not the anything-goes postmodern philosophy that many believe it to be. It is as if there is something about the world which compels scientists to do science in a similar way and for ideas to converge. I recently came across the concept of Linguistic Idealism, and being a recent reader of Wittgenstein this is a truly novel concept for me. Linguistic Idealism is a sort of dependence on language, or the Gamest-of-all-games that we (humans) play. It is a sort of epiphany that all statements we make about the world are statements within the customs of language which results in a criticism of the validity of those statements with respect to correspondence to an external reality. The criticism of statements' validity stems from the fact that they rely on language, a somewhat arbitrary set of customs and rules which we follow when we communicate. Philosophers such as Sellars have gone as far as to say that all awareness is linguistically mediated. If we step back, can we say anything at all about perception?

I'm currently reading a book on Wittgenstein called "Wittgenstein's Copernican Revolution: The Question of Linguistic Idealism."

Monday, June 29, 2009

Its all about the data

I'm at Google this summer (Google summer internship round #2) because its where the data is. If you want to recognize objects from images you need to learn what objects look like. If you want to learn what an object looks like you need to have many examples of that object. You then feed those instances into an algorithm to figure out its essence -- what it is about that object's appearance that makes it that object. Google has the data and Google has the infrastructure to process that data, so I'm there for the summer.

Friday, June 19, 2009

A Shift of Focus: Relying on Prototypes versus Support Vectors

The goal of today's blog post is to outline an important difference between traditional categorization models in Psychology such as Prototype Models, and Support Vector Machine (SVM) based models.


When solving a SVM optimization problem in the dual (given a kernel function), the answer is represented as a set of weights associated with each of the data-centered kernels. In the Figure above, a SVM is used to learn a decision boundary between the blue class (desks) and the red class (chairs). The sparsity of such solutions means that only a small set of examples are used to define the class decision boundary. All points on the wrong side of the decision boundary and barely yet correctly classified points (within the margin) have non-zero weights. Many Machine Learning researchers get excited about the sparsity of such solutions because in theory, we only need to remember a small number of kernels for test time. However, the decision boundary is defined with respect to the problematic examples (misclassified and barely classified ones) and not the most typical examples. The most typical (and easy to recognize) examples are not even necessary to define the SVM decision boundary. Two data sets that have the same problematic examples, but significant differences in the "well-classified" examples might result in the same exact SVM decision boundary.

My problem with such boundary-based approaches is that by focusing only on the boundary between classes useful information is lost. Consider what happens when two points are correctly classified (and fall well beyond the margin on their correct side): the distance-to-decision-boundary is not a good measure of class membership. By failing to capture the "density" of data, the sparsity of such models can actually be a bad thing. As with discriminative methods, reasoning about the support vectors is useful for close-call classification decisions, but we lose fine-scale membership details (aka "density information") far from the decision surface.


In a single-prototype model (pictured above), a single prototype is used per class and distances-to-prototypes implicitly define the decision surface. The focus is on exactly the 'most confident' examples, which are the prototypes. Prototypes are created during training -- if we fit a Gaussian distribution to each class, the mean becomes the prototype. Notice that by focusing on Prototypes, we gain density information near the prototype at the cost of losing fine-details near the decision boundary. Single-Prototype models generally perform worse on forced-choice classification tasks when compared to their SVM-based discriminative counterparts; however, there are important regimes where too much emphasis on the decision boundary is a bad thing.

In other words, Prototype Methods are best and what they were designed to do in categorization, namely capture Typicality Effects (see Rosch). It would be interesting to come up with more applications where handing Typicality Effects and grading membership becomes more important than making close-call classification decision. I suspect that in many real-world information retrieval applications (where high precision is required and low recall tolerated) going beyond boundary-based techniques is the right thing to do.

Tuesday, June 16, 2009

On Edelman's "On what it means to see"

I previously mentioned Shimon Edelman in my blog and why his ideas are important for the advancement of computer vision. Today I want to post a review of a powerful and potentially influential 2009 piece written by Edelman.

Below is a review of the June 16th, 2009 version of this paper:
Shimon Edelman, On what it means to see, and what we can do about it, in Object Categorization: Computer and Human Vision Perspectives, S. Dickinson, A. Leonardis, B. Schiele, and M. J. Tarr, eds. (Cambridge University Press, 2009, in press). Penultimate draft.

I will refer to the article as OWMS (On What it Means to See).

The goal of Edelman's article is to demonstrate the limitations of conceptual vision (referred to as "seeing as"), criticize the modern computer vision paradigm as being overly conceptual, and show how providing a richer representation of a scene is required for advancing computer vision.

Edelman proposes non-conceptual vision, where categorization isn't forced on an input -- "because the input may best be left altogether uninterpreted in the traditional sense." (OWMS) I have to agree with the author, where abstracting away the image into a conceptual map is not only an impoverished view of the world, but it is not clear whether such a limited representation is useful for other tasks relying on vision (something like the bottom of Figure 1.2 in OWMS or the Figure seen below and taken from my Recognition by Association talk).


Building a Conceptual Map = Abstracting Away





Drawing on insights from the influential Philosopher Wittgenstein, Edelman discusses the difference between "seeing" versus "seeing as." "Seeing as" is the easy-to-formalize map-pixels-to-objects attitude which modern computer vision students are spoon fed from the first day of graduate school -- and precisely the attitude which Edelman attacks in this wonderful article. To explain "seeing" Edelman uses some nice prose from Wittgenstein's Philosophical Investigations; however, instead of repeating the passages Edelman selected, I will complement the discussion with a relevant passage by William James:

The germinal question concerning things brought for the first time before consciousness is not the theoretic "What is that?" but the practical "Who goes there?" or rather, as Horwicz has admirably put it, "What is to be done?" ... In all our discussions about the intelligence of lower animals the only test we use is that of their acting as if for a purpose. (William James in Principles of Psychology, page 941)

"Seeing as" is a non-invertible process that abstracts away visual information to produce a lower dimensional conceptual map (see Figure above), whereas "seeing" provides a richer representation of the input scene. Its not exactly clear what is the best way to operationalize this "seeing" notion in a computer vision system, but the escapability-from-formalization might be one of the subtle points Edelman is trying to make about non-conceptual vision. Quoting Edelman, when "seeing" we are "letting the seething mass of categorization processes that in any purposive visual system vie for the privilege of interpreting the input be the representation of the scene, without allowing any one of them to gain the upper hand." (OWMS) Edelman goes on to criticize "seeing as" because vision systems have to be open-ended in the sense that we cannot specify ahead of time all the tasks that vision will be applied to. According to Edelman, conceptual vision cannot capture the ineffability (or richness) of the human visual experience. Linguistic concepts capture a mere subset of visual experience, and casting the goal of vision as providing a linguistic (or conceptual) interpretation is limited. The sparsity of conceptual understanding is one key limitation of the modern computer vision paradigm. Edelman also criticizes the notion of a "ground-truth" segmentation in computer vision, arguing that a fragmentation of the scene into useful chunks is in the eye of the beholder.

To summarize, Edelman points out that "The missing component is the capacity for having rich visual experiences... The visual world is always more complex than can be expressed in terms of a fixed set of concepts, most of which, moreover, only ever exist in the imagination of the beholder." (OWMS) Being a pragmatist, many of these words resonate deeply within my soul, and I'm particularly attracted to elements of Edelman's antirealism.

I have to give two thumbs up to this article for pointing out the flaws in the current way computer vision scientists go about tackling vision problems (in other words researchers too often blindly work inside the current computer vision paradigm and do not often enough question fundamental assumptions which can help new paradigms arise). Many similar concerns regarding Computer Vision I have already pointed out on this blog, and it is reassuring to find others point to similar paradigmatic weaknesses. Such insights need to somehow leave the Philosophy/Psychology literature and make a long lasting impact in the CVPR/NIPS/ICCV/ECCV/ICML communities. The problem is that too many researchers/hackers actually building vision systems and teaching Computer Vision courses have no clue who Wittgenstein is and that they can gain invaluabe insights from Philosophy and Psychology alike. Computer Vision is simply not lacking computational methods, it is gaining critical insights that cannot be found inside an Emacs buffer. In order to advance the field, one needs to: read, write, philosophize, as well as mathematize, exercise, diversify, be a hacker, be a speaker, be one with the terminal, be one with prose, be a teacher, always a student, a master of all trades; or simply put, be a Computer Vision Jedi.

Friday, June 12, 2009

Exemplars, Prototypes, and towards a Theory of Concepts for AI

While initial musings (and some early theories) on Categorization come from Philosophy (think Categories by Aristotle), most modern research on Categorization which adheres to the scientific method comes from Psychology (Concept Learning on Wikipedia). Two popular models which originate from Psychology literature are Prototype Theory and Exemplar Theory. Summarizing briefly, categories in Prototype Theory are abstractions which summarize a category while categories in Exemplar Theory are represented nonparametrically. While I'm personally a big proponent of Exemplar Theory (see my Recognition by Association CVPR2008 paper), I'm not going to discuss the details of my philosophical stance in this post. I want to briefly point out the shortcomings of these two simplified views of concepts.

Researchers focusing on Categorization are generally dealing with a very simplified (and overly academic) view of the world -- where the task is to categorize a single input stimulus. The problem is that if we want a Theory of Concepts that will be the backbone of intelligent agents, we have to deal with relationships between concepts with as much fervor as the representations of concepts themselves. While the debate concerning exemplars vs. prototypes has been restricted to these single stimulus categorization experiments, it is not clear to me why we should prematurely adhere to one of these polarized views before we consider how we can make sense of inter-category relationships. In other words, if an exemplar-based view of concepts looks good (so-far) yet it is not as useful for modeling relationships as a prototype-view, then we have to change our views. Following James' pragmatic method, we should evaluate category representations with respect to a larger system embodied in an intelligent agent (and its ability to cope with the world) and not the overly academic single-stimulus experiments dominating experimental psychology.

On another note, I submitted my most recent research to NIPS last week (supersecret for now), and went to a few Phish concerts. I'm driving to California next week and I start at Google at the end of June. I also started reading a book on James and Wittgenstein.


Monday, March 30, 2009

Time Travel, Perception, and Mind-wandering

Today's post is dedicated to ideas promulgated by Bar's most recent article, "The proactive brain: memory for predictions."

Bar builds on the foundation of his former thesis, namely that the brain's 'default' mode of operation is to daydream, fantasize, and continuously revisit and reshape past memories and experiences. While it makes sense that traversing the internal network of past experiences is useful when trying to understand a complex novel phenomenon, why exert so much work when just 'chilling out' a.k.a. being in the 'default' mode? Bar's proposal is that this seemingly wasteful daydreaming is actually crucial for generating virtual experiences and synthesizing not-directly-experienced, yet critically useful memories of alternate scenarios. These 'alternate future memories' are how our brain recombines tidbits from actual experiences and helps us understand novel scenarios before they actually happen. It makes sense that the brain has a method for 'densifying' the network of past experiences, but that this happens in the 'default' mode a truly bold view held by Bar.

In the domain of visual perception and scene understanding, the world has much regularity. Thus the predictions generated by our brain often match the percept, and thus accurate predictions rid us of the need to exert mental brainpower on certain predictable aspects of the world. For example, seeing a bunch of cars on a road along with a bunch of windows on a building pre-sensitizes us so much with respect to seeing a stop sign in an intimate spatial relationship with the other objects that we don't need to perceive much more than speckle of red for a nanosecond to confirm its presence in the scene.

Quoting Bar, "we are rarely in the 'now'" since when understanding the visual world we integrate information from multiple points in time. We use the information perceptible to our senses (the now), memories of former experiences (the past), as well all of the recombined and synthesized scenarios explored by our brains and encoded as virtual memories (plausible futures). In each moment of our waking life, our brains provide us with a shortlist of primed (to be expected) objects, contexts, and their configurations related to our immediate perceptible future. Who says we can't travel through time? -- it seems we are already living a few seconds ahead of direct perception (the immediate now).

Sunday, March 29, 2009

My 2nd Summer Internship in Google's Computer Vision Research Group

This summer I will be going for my 2nd summer internship at Google's Computer Vision Research Group in Mountain View, CA. My first real internship ever was last summer at Google -- I loved it.

There are many reasons for going back for the summer. Being in the research group and getting to address the same types of vision/recognition related problems as during my PhD is very important for me. It is not just a typical software engineering internship -- I get an better overall picture of how object recognition research can impact the world at a large scale, the Google-scale, before I finish my PhD and become set in my ways. Being in an environment where one can develop something super cool and weeks later millions of people see a difference in the way they interact with the internet (via Google's services of course) is also super exciting. Finally, the computing infrastructure that Google has set up for its researchers/engineers is unrivaled when it comes to large scale machine learning.

Many Google researchers (such as Fernando Periera) are big advocates of the data-driven mentality, where using massive amounts of data coupled with simple algorithms has more promise than complex algorithms with small amounts of training data. In earlier posts I already mentioned how my advisor at CMU is a big advocate of this approach in Computer Vision. This Unreasonable Effectiveness of Data is a powerful mentality yet difficult to embrace with the computational resources offered by one's computer science department. But this data-driven paradigm is not only viable at Google -- it is the essence of Google.

Thursday, March 26, 2009

Beyond Categorization: Getting Away From Object Categories in Computer Vision

Natural language evolved over thousands of years to become the powerful tool that is is today. When we say things using language to convey our experiences with the world, we can't help but refer to object categories. When we say things such as "this is a car" what we are actually saying is "this is an instance from the car category." Categories let us get away from referring to individual object instances -- in most cases knowing that something belongs to a particular category is more than enough knowledge to deal with it. This is a type of "understanding by compression" or understanding by abstracting away the unnecessary details. In the words of Rosch, "the task of category systems is to provide maximum information with the least cognitive effort." Rosch would probably agree that it only makes sense to talk about the utility of a category system (a for getting a grip on reality) as opposed to the truth value of a category system with respect how well it aligns to observer-independent reality. The degree of pragmatism expressed by Rosch is something that William James would have been proud of.

From a very young age we are taught language and soon it takes over our inner world. We 'think' in language. Language provides us with a list of nouns -- a way of cutting up the world into categories. Different cultures have different languages that cut up the world differently and one might wonder how well the object categories contained in any given single language correspond to reality -- if it even makes sense to talk about an observer independent reality. Rosch would argue that human categorization is the result of "psychological principles of categorization" and is more related to how we interact with the world than how the world is. If the only substances we ingested for nutrients were types of grass, then categorizing all of the different strains of grass with respect to flavor, vitamin content, color, etc would be beneficial for us (as a species). Rosch points out in her works that her ideas refer to categorization at the species-level and she calls it human categorization. She is not referring to a personal categorization; for example, the way a child might cluster concepts when he/she starts learning about the world.

It is not at all clear to me whether we should be using the categories from natural language as the to-be-recognized entities in our image understanding systems. Many animals do not have a language with which they can compress percepts into neat little tokens -- yet they have no problem interacting with the world. Of course, if we want to build machines that understand the world around them in a way that they can communicate with us (humans), then language and its inherent categorization will play a crucial role.

While we ultimately use language to convey our ideas to other humans, how early are the principles of categorization applied to perception? Is the grouping of percepts into categories even essential for perception? I doubt that anybody would argue that language and its inherent categorization is not useful for dealing with the world -- the only question is how it interacts with perception.

Most computer vision researchers are stuck in the world of categorization and many systems rely on categorization at a very early stage. A problem with categorization is its inability to deal with novel categories -- something which humans must deal with at a very young age. We (humans) can often deal with arbitrary input and using analogies can still get a grip and the world around us (even when it is full of novel categories). One hypothesis is that at the level of visual perception things do not get recognized into discrete object classes -- but a continuous recognition space. Thus instead of asking the question, "What is this?" we focus on similarity measurements and ask "What is this like?". Such a comparison-based view would help us cope with novel concepts.

Sunday, March 22, 2009

mr. doob's experiments

Mr. Doob has some cool (albeit simple) computer vision demos using Flash. Check them out.
I should get my fractals to animate with music in Flash - ala Mr. Doob.


This is debug link. http://rfrrr.com/

Thursday, March 19, 2009

when you outgrow homework-code: a real CRF inference library to the rescue

I have recently been doing some CRF inference for an object recognition task and needed a good ol' Max-Product Loopy Belief Propagation. I revived my old MATLAB-based implementation that grew out of a Probabilistic Graphical Models homework. Even though I had vectorized the code and had tested it for correctness -- would my own code be good enough on problems involving thousands of nodes and arities as high as 200? It was the first time I ran my own code on such large problems and I wasn't surprised when it took several minutes for those messages to stop passing.

I tried using Talya Meltzer's MATLAB package for inference in Undiracted Graphical Models. It is a bunch of MATLAB interfaces to efficient C code. Talya is Yair Weiss's PhD student (so that basically makes her an inference expert).

It was nice to check my old homework-based code and see the same beliefs for a bunch of randomly generated binary planar-grid graphs. However, for medium sized graphs her code was running in ~1 second while my homework code was taking ~30 seconds. That was a sign that I had outgrown my homework-based code. While I was sad to see my own code go, it is a sign of maturity when your research problems mandate a better and more-efficient implementation of such a basic inference algorithm. Her package was easy to use, has plenty of documentation, and I would recommend it to anybody in need of CRF inference.

Thursday, February 12, 2009

Context is the 'glue' that binds objects in coherent scenes.

This is not my own quote. It is on of my favorites from Moshe Bar. It comes from his paper "Visual Objects in Context."

I have been recently giving context (in the domain of scene understanding) some heavy thought.
While Bar's paper is good, the one I wanted to focus on goes back to the 1980s. According to Bar, the following paper (which I wish everybody would at least skim) is "a seminal study that characterizes the rules that govern a scene's structure and their influence on perception."

Biederman, I., Mezzanotte, R. J. & Rabinowitz, J. C. Scene perception: detecting and judging objects undergoing relational violations. Cogn. Psychol. 14, 143–177 (1982).

Biederman outlines 5 object relations. The three semantic (related to object categories) relations are probability, position, and familiar size. The two syntactic (not operating at the object category level) relations are interposition and support. According to Biederman, "these relations might constitute a sufficient set with which to characterize the organizations of a real-world scene as distinct from a display of unrelated objects." The world has structure and characterizing this structure in terms of such rules is quite a noble effort.

A very interesting question that Biederman addresses is the following: do humans reason about syntactic relations before semantic relations or the other way around? A Gibsonian (direct perception) kind of way to think about the world is that processing of depth and space precedes the assignment of identity to the stuff that occupies the empty space around us. J.J. Gibson's view is in accordance with Marr's pipeline.

However, Biederman's study with human subjects (he is a psychologist) suggests that information about semantic relationships among objects is not accessed after semantic-free physical relationships. Quoting him directly, "Instead of a 3D parse being the initial step, the pattern recognition of the contours and the access to semantic relations appear to be the primary stages" as well as "further evidence that an object's semantic relations to other objects are processed simultaneously with its own identification."

Now that I've wet your appetite, let's bring out the glue.

P.S. Moshe Bar was a student of I. Biederman and S. Ullman (Against Direct Perception author).

Thursday, February 05, 2009

suns 2009 cool talks

This past Friday I went SUNS 2009 at MIT, and in my opinion the coolest talks were by Aude Oliva, David Forsyth, and Ce Liu.

While I will not summarize their talks which referred to unpublished work, I will provide a few high level questions that summarize (according to me) the ideas conveyed by these speakers.

Aude: What is the interplay between scene-level context, local category-specific features, as well as category-independent saliency that makes us explore images in a certain way when looking for objects?

David: Is naming all the objects depicted in an image the best way to understand an image? Don't we really want some type of understanding that will allow us to reason about never before seen objects?

Ce: Can we understand the space of all images by cleverly interpolating between what we are currently perceiving and what we have seen in the past?

Wednesday, January 28, 2009

SUNS 2009: Scene Understanding Symposium at MIT

This Friday (January 30, 2009) I will be attending SUNS 2009, otherwise known as the Scene Understanding Symposium, held at MIT and organized by Aude Oliva, Thomas Serre, and Antonio Torralba. It is free, so grad students in the area should definitely go!

Quoting the SUNS 2009 homepage, "SUnS 09 features 16 speakers and about 20 poster presenters from a variety of disciplines (neurophysiology, cognitive neuroscience, visual cognition, computational neuroscience and computer vision) who will address a range of topics related to scene and natural image understanding, attention, eye movements, visual search, and navigation."

I'm looking forward to the talks by researchers such as Aude Oliva, David Forsyth, Alan Yuille, and Ted Adelson. I will try to blog about some cool stuff while I'm there.

Tuesday, January 13, 2009

Computer Vision Courses, Measurement, and Perception

The new semester began at CMU and I'm happy to announce that I'm TAing my advisor's 16-721 Learning Based Methods in Vision this semester. I'm also auditing Martial Hebert's Geometry Based Methods in Vision.

This semester we're trying to encourage students of 16-721 LBMV09 to discuss papers using a course discussion blog. Quicktopic has been used in the past, but this semester we're using Google's Blogger.com for the discussion!

In the first lecture of LBMV, we discussed the problem of Measurement versus Perception in a Computer Vision context. The idea is that while we could build vision systems to measure the external world, it is percepts such as "there is a car on the bottom of the image" and not measurements such as "the bottom of the image is gray" that we are ultimately interested in. However, the line between measurement and perception is somewhat blurry. Consider the following gedanken experiment: place a human in a box and feed him an image and the question "is there a car on the bottom of the image?". Is it legitimate to call this apparatus as a measurement device? If so, then isn't perception a type of measurement? We would still have the problem of building a second version of this measurement device -- different people have different notions of cars and when we start feeding two apparatuses examples of objects that are very close to trucks/buses/vans/cars then would would loss measurement repeatability.

This whole notion of measurement versus perception in computer vision is awfully similar to the theory and observation problem in philosophy of science. Thomas Kuhn would say that the window through which we peer (our scientific paradigm) circumscribes the world we see and thus it is not possible to make theory-independent observations. For a long time I have been a proponent of this post modern view of the world. The big question that remains is: for computer vision to be successful how much consensus must there be between human perception and machine perception? If according to Kuhn Aristotelian and Galilean physicists would have different "observations" of an experiment, then should we expect intelligent machines to see the same world that we see?