Tombone's Computer Vision Blog: September 2011

Thursday, September 29, 2011

plenoptica theoretica: fields vs. particles

"All bodies together, and each by itself, give off to the surrounding air an infinite number of images which are all-pervading and each complete, each conveying the nature, colour and form of the body which produces it." --Leonardo da Vinci

Yesterday, Edward H. Adelson ("Ted Adelson") gave a lecture at MIT on the plenoptic function and its role in understanding (and unifying) early vision. Ted has been at MIT for quite some time. He is sometimes described as being (1/3 human vision, 1/3 computer vision, and 1/3 computer graphics) and was Bill Freeman's advisor.

What is the plenoptic function?

Etymology: Plenoptic comes from plenus+optic.
plenus: full, filled
optic: relating to eye or vision

Ted Adelson imagined a sort of unified field theory for vision -- instead of proposing a jungle of atoms such as edges, corners, and peaks, the plenoptic function offers a unifying principle under which color, texture, motion, etc. can all be viewed as gradients of the plenoptic function. The plenoptic function is a complete representation which contains, implicitly, a description of every possible photograph that could be taken of a particular space-time chunk of the world. Omniscience is to knowing as the plenoptic function is to seeing.

Ted remarked that if you asked him 20 years ago what he was working on in vision, you might have gotten a confusing answer. "Do you work on texture, motion, stereo, or illumination?" you might ask. "All of them. Aren't they the all the same thing?" he might reply. Ted argues that vision scientists in the 80s and early 90s tried to cut up the world of vision into neat little "particles" and would develop theories with their favorite particle -- here the particles are early vision concepts such as color, texture, and motion.

In their seminal paper on the plenoptic function, The Plenoptic Function and the Elements of Early Vision, Adelson and Bergen state that "the elemental operations of early vision involve the measurement of local change long various directions within the plenoptic function." As a theoretical device, the plenoptic function has left a long-standing impression on me. I first came across Ted's ideas back in 2006 -- thanks to Alyosha Efros' course on vision. Having just completed a BS in Physics, I was well aware of unified field theories in physics, and the plenoptic function seemed too cool to forget.

What the plenoptic function means to me
However, if the plenoptic function is the Maxwell's equations equivalent for early (low-level) vision, then what I'm ultimately after is the Schrodinger's equation of late (high-level) vision. In his lecture, Ted Adelson acknowledged that vision scientists have a sort of Atom Envy -- they envy the physicists who are able to understand the world in terms of a few fundamental ontologically meaningful entities. First of all, I like particles, but I have no apriori reason to be in the particle camp all of my life. Secondly, the plenoptic function was all about early vision, but my research in vision is all about high-level vision such as object recognition. I might be young and foolish, but the search for a "mind mechanics" has been a part of my research life (at least partially) since ~2003. Right now, my best shot at an answer is that exemplars and associations are the basic building blocks of high-level vision -- but unlike the British Empricisits (the champions of associationism), I would argue that the atomic building blocks of associations are object instances, and not ideas such as "roundness" and "blueness". Complex ideas are then the object categories which arise out of the interactions between these concrete elements of experience.

Conclusion
The Adelson and Bergen paper is a must read for anybody serious about vision. While it might not offer much in terms of "what next" in vision research, it is nevertheless a useful construct in thinking about vision. I get excited when it comes down to unifying principles and I wish there were more papers like this in vision, especially for high-level vision.

Wednesday, September 28, 2011

Kant's Intuitions, the intentional stance, and reverse-engineering the mind

Immanuel Kant (1724-1804), Daniel Dennett (1942-), Josh Tenenbaum (~1971-)

“Thoughts without content are empty, intuitions without concepts are blind. The understanding can intuit nothing, the senses can think nothing. Only through their unison can knowledge arise.” -- Immanuel Kant

“We live in a world that is subjectively open. And we are designed by evolution to be "informavores", epistemically hungry seekers of information, in an endless quest to improve our purchase on the world, the better to make decisions about our subjectively open future.” -- Daniel Dennett

"For scientists studying how humans cometo understand their world, the central challenge is this: How do our minds get so much from so little? We build rich causal models,make strong generalizations, and construct powerful abstractions, whereas the input data are sparse, noisy, and ambiguous—in every way far too limited. A massive mismatch looms between the information coming in through our senses and the outputs of cognition." -- Josh Tenenbaum

Organizing by space (space, time, and physics)

There are two faculties of understanding which it is unlikely we have acquired from experience. The first is that of understanding objects as extended bodies in a 3D space and thus occupying some volume. I believe it is Kant argued best against the hardcore British Empiricists, who proclaimed that experience is the sole originator of knowledge. Experiences are the pen strokes, which fill the Empricisit’s tabula rasa. Kant argued (against Hume) that the concept of a spatially extended object is not acquired from experience – the very notion of experience requires that we already possess the notion of an object in order to have a meaningful percept. It is as if the Empiricists failed to acknowledge that to make strokes on a sheet of paper, we need to already have a pen. Kant’s intuitions are the pens of experience. The requirement of having suitable intuitions for grouping percepts into experiences is what Kant described as a form of transcendental idealism. “Objectness” is a faculty of human understanding, not something acquired from experience. If you are a vision researcher, being aware of this can have drastic implications on your research programme.

It has also been argued that there are some primitive notions of object dynamics, aka folk-physics, which can are possessed by very young children. Given the uniformity of human experience (at least I have no ostensible reason to double that my colleague’s experiences significantly differ from my own), and the diversity in our individual upbringing, it is also unlikely that folk-physics is learned from experience. However, I don't want to make any strong claims regarding folk-physics. I feel safe to say that Quantum Mechanics is another story -- it requires years of mathematics and thousands hours of deliberate problem solving to grasp.

Organizing by mind (psychology, mind, and intent)

The second faculty of understanding, which can be found in many aspects of human intelligence, is that of understanding the world in terms of cognitive agents. Humans have an amazing capability when it comes to attributing stuff with having a mind. This way of thinking about the world is so common and uniform among children all over the world, that the differences in their upbringing cannot be reconciled with the uniformity of their capability to project humanness onto objects. Consider the following video (thanks J. Tenenbaum's videos/lectures for pointing this out).

We cannot just view this video os triangles, dots, and lines. Each one of understands the story in terms of a narrative based on agents and their intent. We are stimulated by the external world, we take as input sense-data, and the brain helps us make sense of it -- it turns the hodgepodge of data into experience. But the brain is a mold, it conforms percepts to some shape defined by the mold. These molds are the faculties of understanding which let us understand things, it is like the faculties of understanding are basis vectors onto which we project all input sense data. The data is weak and noisy, the priors are strong, and understanding is the result of their union. An experience without a proper basis is blind, it is just a ball of percepts. These faculties allow us to have experience. The experiences, coupled with memory, allow us to obtain understanding – where understanding is the relationship between a given experience and past experiences, either in the form of direct associations between currently-experienced-objects and previously-experienced-objects, or rules abstracted away from previously-experienced-objects being directly applied to the current sense data.

What I am talking about is what philosopher Daniel Dennett refers to by the “intentional stance.” Given my background in AI and philosophy of mind, it is very likely that Dennett and I have had the same influences. I like to juxtapose my ideas with those of the classical philosophers such as Descartes, Locke, Kant, Wittgenstein and Pinker -- I’m not sure how Dennett motivates his philosophy nor do I know against whose ideas he juxtaposes his own stance.

At MIT, J. Tenenbaum is pushing these ideas to the next level. I only wish there was more perception in his work -- toy worlds just don't do it for me. I want to build intelligent machines, and really cannot afford to sidestep the issue of perception. Here is a great talk by Josh Tenenbaum on reverse engineering the mind from NIPS 2010. Video is on videolectures.net, just click the link.

Implications for Artificial Intelligence and Machine Vision

Following Josh Tenenbaum, I think that a criticism of classical machine learning is long overdue. Machine Learning, as a field, has been spewing out hardcore empiricists. “Let me download your features, my machine learning algorithm will take care of the rest,” they say. It is like the glory is in the mathematics, which manipulates N-D vectors. But I argue that intelligence isn’t “in the calculus,” it is what the primitives in the calculus actually represent. As an undergraduate I proclaimed, “I am not a mathematician, I am a physicists. I care about the structure of the world, not the structure of proofs. “ As a graduate student I proclaimed, “The glory isn’t in the manipulation of vectors, the glory is understanding the what/why of encoding information about the world into vectors. I am a computer vision researcher, not a machine learning researcher.” That is why the view of the world as coming from K different classes is wrong – this is merely a convenient view if the statistician’s toolbox is at your disposal. It is all about structuring the input to match a researcher’s high-level intuitions about the world.

Friday, September 09, 2011

My first week at MIT: What is intelligence?

In case anybody hasn't heard the news, I am no longer a PhD student at CMU. After I handed in my camera-ready dissertation, it didn't take long for my CMU advisor to promote me from his 'current students' to 'former students' list on his webpage. Even though I doubt there is anyplace in the world which can rival CMU when it comes to computer vision, I've decided to give MIT a shot. I had wanted to come to MIT for a long time, but 6 years ago I decided to choose CMU's RI over MIT's CSAIL for my computer vision PhD. Life is funny because the paths we take in life aren't dead-ends -- I'm glad I had a second chance to come to MIT.

In case you haven't heard, MIT is a little tech school somewhere in Boston. Lots of undergrads can be caught wearing math Tshirts and posters like the following can be found on the walls of MIT:

A cool (undergrad targeted) poster I saw at MIT

As of last week I'm officially a postdoc in CSAIL and I'll be working with Antonio Torralba and Aude Oliva. I've been closely following both Antonio's and Aude's work over the last several years and getting to work with these giants of vision will surely be a treat. In case you don't know what a postdoc is, it is a generic term used to describe post-PhD researchers with generally short term (1-3 year) appointments. People generally use the term Postdocotral Fellow or Postdoctoral Associate to describe their position in a university. I guess 3 years working on vision as an undergrad and 6 years of working on vision as a grad student just wasn't enough for me...

I've been getting adjusted to my daily commute through scenic Boston, learning about all the cool vision projects in the lab, as well as meeting all the PhD students working with Antonio. Today was the first day of a course which I'm sitting-in on, titled "What is intelligence?". When I saw a course offered by two computer vision titans (Shimon Ullman and Tomaso Poggio), I couldn't resist. Here is the information below:

What is intelligence?

"What is intelligence?" course homepage: http://web.mit.edu/9.s915/www/

Class Times:	Friday 11:00-2:00 pm
Units:	3-0-9
Location:	46-5193 (NOTE: we had to choose a bigger room)
Instructors:	Shimon Ullman and Tomaso Poggio

The class was packed -- we had to relocate to a bigger room. Much of today's lecture was given by Lorenzo Rosasco. Lorenzo is the Team Leader of IIT@MIT. Here is a blurb from IIT@MIT's website describe what this 'center' is all about:

The IIT@MIT lab was founded from an agreement between the Massachusetts Institute of Technology(MIT) and the Istituto Italiano di Tecnologia (IIT). The scientific objective is to develop novel learning and perception technologies – algorithms for learning, especially in the visual perception domain, that are inspired by the neuroscience of sensory systems and are developed within the rapidly growing theory of computational learning. The ultimate goal of this research is to design artificial systems that mimic the remarkable ability of the primate brain to learn from experience and to interpret visual scenes.

Another cool class offered this semester at MIT is Antonio Torralba's Grounding Object Recognition and Scene Understanding.