Tombone's Computer Vision Blog: March 2009

Monday, March 30, 2009

Time Travel, Perception, and Mind-wandering

Today's post is dedicated to ideas promulgated by Bar's most recent article, "The proactive brain: memory for predictions."

Bar builds on the foundation of his former thesis, namely that the brain's 'default' mode of operation is to daydream, fantasize, and continuously revisit and reshape past memories and experiences. While it makes sense that traversing the internal network of past experiences is useful when trying to understand a complex novel phenomenon, why exert so much work when just 'chilling out' a.k.a. being in the 'default' mode? Bar's proposal is that this seemingly wasteful daydreaming is actually crucial for generating virtual experiences and synthesizing not-directly-experienced, yet critically useful memories of alternate scenarios. These 'alternate future memories' are how our brain recombines tidbits from actual experiences and helps us understand novel scenarios before they actually happen. It makes sense that the brain has a method for 'densifying' the network of past experiences, but that this happens in the 'default' mode a truly bold view held by Bar.

In the domain of visual perception and scene understanding, the world has much regularity. Thus the predictions generated by our brain often match the percept, and thus accurate predictions rid us of the need to exert mental brainpower on certain predictable aspects of the world. For example, seeing a bunch of cars on a road along with a bunch of windows on a building pre-sensitizes us so much with respect to seeing a stop sign in an intimate spatial relationship with the other objects that we don't need to perceive much more than speckle of red for a nanosecond to confirm its presence in the scene.

Quoting Bar, "we are rarely in the 'now'" since when understanding the visual world we integrate information from multiple points in time. We use the information perceptible to our senses (the now), memories of former experiences (the past), as well all of the recombined and synthesized scenarios explored by our brains and encoded as virtual memories (plausible futures). In each moment of our waking life, our brains provide us with a shortlist of primed (to be expected) objects, contexts, and their configurations related to our immediate perceptible future. Who says we can't travel through time? -- it seems we are already living a few seconds ahead of direct perception (the immediate now).

Sunday, March 29, 2009

My 2nd Summer Internship in Google's Computer Vision Research Group

This summer I will be going for my 2nd summer internship at Google's Computer Vision Research Group in Mountain View, CA. My first real internship ever was last summer at Google -- I loved it.

There are many reasons for going back for the summer. Being in the research group and getting to address the same types of vision/recognition related problems as during my PhD is very important for me. It is not just a typical software engineering internship -- I get an better overall picture of how object recognition research can impact the world at a large scale, the Google-scale, before I finish my PhD and become set in my ways. Being in an environment where one can develop something super cool and weeks later millions of people see a difference in the way they interact with the internet (via Google's services of course) is also super exciting. Finally, the computing infrastructure that Google has set up for its researchers/engineers is unrivaled when it comes to large scale machine learning.

Many Google researchers (such as Fernando Periera) are big advocates of the data-driven mentality, where using massive amounts of data coupled with simple algorithms has more promise than complex algorithms with small amounts of training data. In earlier posts I already mentioned how my advisor at CMU is a big advocate of this approach in Computer Vision. This Unreasonable Effectiveness of Data is a powerful mentality yet difficult to embrace with the computational resources offered by one's computer science department. But this data-driven paradigm is not only viable at Google -- it is the essence of Google.

Thursday, March 26, 2009

Beyond Categorization: Getting Away From Object Categories in Computer Vision

Natural language evolved over thousands of years to become the powerful tool that is is today. When we say things using language to convey our experiences with the world, we can't help but refer to object categories. When we say things such as "this is a car" what we are actually saying is "this is an instance from the car category." Categories let us get away from referring to individual object instances -- in most cases knowing that something belongs to a particular category is more than enough knowledge to deal with it. This is a type of "understanding by compression" or understanding by abstracting away the unnecessary details. In the words of Rosch, "the task of category systems is to provide maximum information with the least cognitive effort." Rosch would probably agree that it only makes sense to talk about the utility of a category system (a for getting a grip on reality) as opposed to the truth value of a category system with respect how well it aligns to observer-independent reality. The degree of pragmatism expressed by Rosch is something that William James would have been proud of.

From a very young age we are taught language and soon it takes over our inner world. We 'think' in language. Language provides us with a list of nouns -- a way of cutting up the world into categories. Different cultures have different languages that cut up the world differently and one might wonder how well the object categories contained in any given single language correspond to reality -- if it even makes sense to talk about an observer independent reality. Rosch would argue that human categorization is the result of "psychological principles of categorization" and is more related to how we interact with the world than how the world is. If the only substances we ingested for nutrients were types of grass, then categorizing all of the different strains of grass with respect to flavor, vitamin content, color, etc would be beneficial for us (as a species). Rosch points out in her works that her ideas refer to categorization at the species-level and she calls it human categorization. She is not referring to a personal categorization; for example, the way a child might cluster concepts when he/she starts learning about the world.

It is not at all clear to me whether we should be using the categories from natural language as the to-be-recognized entities in our image understanding systems. Many animals do not have a language with which they can compress percepts into neat little tokens -- yet they have no problem interacting with the world. Of course, if we want to build machines that understand the world around them in a way that they can communicate with us (humans), then language and its inherent categorization will play a crucial role.

While we ultimately use language to convey our ideas to other humans, how early are the principles of categorization applied to perception? Is the grouping of percepts into categories even essential for perception? I doubt that anybody would argue that language and its inherent categorization is not useful for dealing with the world -- the only question is how it interacts with perception.

Most computer vision researchers are stuck in the world of categorization and many systems rely on categorization at a very early stage. A problem with categorization is its inability to deal with novel categories -- something which humans must deal with at a very young age. We (humans) can often deal with arbitrary input and using analogies can still get a grip and the world around us (even when it is full of novel categories). One hypothesis is that at the level of visual perception things do not get recognized into discrete object classes -- but a continuous recognition space. Thus instead of asking the question, "What is this?" we focus on similarity measurements and ask "What is this like?". Such a comparison-based view would help us cope with novel concepts.

Sunday, March 22, 2009

mr. doob's experiments

Mr. Doob has some cool (albeit simple) computer vision demos using Flash. Check them out.
I should get my fractals to animate with music in Flash - ala Mr. Doob.

This is debug link. http://rfrrr.com/

Thursday, March 19, 2009

when you outgrow homework-code: a real CRF inference library to the rescue

I have recently been doing some CRF inference for an object recognition task and needed a good ol' Max-Product Loopy Belief Propagation. I revived my old MATLAB-based implementation that grew out of a Probabilistic Graphical Models homework. Even though I had vectorized the code and had tested it for correctness -- would my own code be good enough on problems involving thousands of nodes and arities as high as 200? It was the first time I ran my own code on such large problems and I wasn't surprised when it took several minutes for those messages to stop passing.

I tried using Talya Meltzer's MATLAB package for inference in Undiracted Graphical Models. It is a bunch of MATLAB interfaces to efficient C code. Talya is Yair Weiss's PhD student (so that basically makes her an inference expert).

It was nice to check my old homework-based code and see the same beliefs for a bunch of randomly generated binary planar-grid graphs. However, for medium sized graphs her code was running in ~1 second while my homework code was taking ~30 seconds. That was a sign that I had outgrown my homework-based code. While I was sad to see my own code go, it is a sign of maturity when your research problems mandate a better and more-efficient implementation of such a basic inference algorithm. Her package was easy to use, has plenty of documentation, and I would recommend it to anybody in need of CRF inference.