Friday, December 31, 2010

why I should be hacking with a kinect

It was recently brought to my attention that Alex Berg a.k.a. Alexander Berg is hacking with a Kinect.
In case you didn't know, Alex Berg is an assistant professor at Stony Brook University as of Sept 2010.  He came out of Jitendra Malik's group, and can be thought of as my academic uncle (because he got his PhD with Jitendra at basically the same time as my advisor, Alyosha Efros). I am a big fan of Alex Berg's work.  (See the paper at ECCV 2010: What does classifying more than 10,000 image categories tell us? and note his upcoming workshop "Large Scale Learning for Vision" at CVPR 2011).

I had already known that Xiaofeng Ren has been hacking with RGB-D cameras such as the Kinect for some time now.  Xiaofeng (pronunciation of first name) Ren is a research scientist at Intel Labs Seattle since 2008 and on the affiliate faculty at the CSE department at UW since 2010.  He is another one my many academic uncles and has contributed greatly to the field of Computer Vision.  For some of his recent work with Kinects, see his RGB-D project page. Xiaofeng Ren's work has also been very influential during my own research -- it is worthwhile to recall that he coined the term "superpixels", which is prevalent in contemporary Computer Vision literature.

So when I learned that these bad-ass ex-Berkeley hackers are hacking with Kinects, I figured it was the time to acquire one of my own. I bought a Kinect today and plan on playing with Alex Berg's kinect2matlab interface for Mac OS X soon!

So, why aren't you hacking with a kinect?

Monday, November 22, 2010

I, for one, welcome our new Visual Memex-based overlords

Welcome to the era of visual intelligence -- the era of Visual Memex-based overlords (now in 3D!)

The goal of today's post is simple: to empower you, the reader, with an exciting and fresh perspective on the problem of visual reasoning.  This simple idea is one of the central tenets promulgated in my upcoming doctoral dissertation -- and but I'd like to give this potent meme a head start.  Visual Memex-style reasoning is not the kind of reasoning that is described in classic graduate level textbooks on AI (e.g. first-order logic).  In the case that you've mentally over-fit to a graduate-level CS curriculum, you might even portray my iconoclastic views as ramblings of a lunatic -- this is okay, I know at least Ludwig would be proud.

The Visual Memex is a mentality/perspective which, I believe, can overcome many limitations faced by modern computer vision systems.  What the Visual Memex can do for visual intelligence is akin to what the World Wide Web has done for knowledge (see Weinberger's excellent book "Everything is Miscellaneous" for the full argument).  It's akin to using Google for acquiring knowledge instead of going to the library -- maybe knowledge was never meant to be embedded in bookshelves.  The idea is embarrassingly simple: replace visual object categories with object exemplars and relationships between those exemplars.  Maybe the linguistic categories that we (as humans) cannot seem to live without are mere shadows cast on the wall of a dark cave.  Psychologists have long abandoned rigid categories in their models of how humans think about concepts, but the notion of a class is so fundamental to contemporary Machine Learning that many haven't even bothered to question its tenuous foundations.  While categories (also referred to as classes) definitely make learning algorithms easier to formalize, maybe its better to let the data speak for itself.  Free the data!

One upcoming research paper inspired by this category-free mentality is: Context-Based Search for 3D Models, by Matthew Fisher and Pat Hanrahan, of Stanford University.  This paper will be presented at SIGGRAPH Asia 2010.  Maybe it is time to abandon those rigid categories and memexify your own research problem?

Further reading:

Saturday, November 13, 2010

CVPR, the A+'s of yesteryear, and robots need us

It is November yet again, and I'm proud to announce my last CVPR submission as a graduate student!  It is that time of the year again -- the post-CVPR downtime.  It is time to mentally tuck away the fruits of our labor (NOTE: you might want to create a readme.txt which explains how to use the 20,000 lines of code you wrote in the 7 days preceding the deadline), consider the long-term impact of our work, and perhaps even reconsider our position in life.

I want to build intelligent machines, and I feel vision is the right place to start -- even roboticists such as Rodney Brooks started out in vision. However, I don't feel churning out 'cute' CVPR papers is going to do much.  Perhaps if all one cares about in life is getting tenure at a top ranked university, then proof-of-concept papers might be the path of least resistance.  But remember when you were a teen, and you wanted to build a rocket which lets you travel at relativistic speeds -- allowing you to go back in time?  Or remember when you wanted to build those humanoid robots that would both entertain your kid sister and help out your mother with house chores?

So why did so many intelligent people I know abandon those grandeur dreams and settle for bread crumbs?  Getting your paper submitted to a peer-reviewed conference, so that you can pad your CV with another publication, is incommensurable with the dreams you once had.  The publication of today is the A+ of yesteryear, and it is just way too easy for us, intellectuals, to stay comfortable with those A's, without asking for more.  But robots need us, CVPR papers won't assemble themselves into intelligent machines.

But the deadline is over, and now its time to relax.  If my rant did not make sense to you, then I envy you.  I have to move on to more positive things -- I need to finish reading Pinker's Blank Slate, read some more Wittgenstein (and fully assimilate his criticism of Augustine's theory of language-acquisition), waste two days playing with the Riemann Zeta function (because the Basel problem was only the beginning), play some guitar, etc.

Wednesday, August 25, 2010

Multifaceted Knowledge Representation: Ideas from Marvin Minsky

"I think a key to AI is the need for several representations of the knowledge, such that when the system is stuck (using one representation) it can jump to use another. When David Marr at MIT moved into computer vision, he generated a lot of excitement, but he hit up against the problem of knowledge representation; he had no good representations for knowledge in his vision systems." -- Marvin Minsky

Check out the full interview with Marvin Minsky here -- a must read for anybody serious about building intelligent machines!  This interview appears to be a part of a larger volume: Hal's Legacy.

I believe that in order to make the enterprise of computer vision of success, we must seriously broaden our outlook on the problem.  Are we seriously expecting algorithms to delineate object boundaries from real images based on statistics of patch descriptors without any sort of model of the world?

I don't know about you, but I seriously want to build intelligent machines.  I don't think there will ever be any sort of low-level SIFT-esque algorithm that "solves vision."  It is a much grander picture of intelligence that I'm really after -- and successful computer vision will be a result(component?) of a higher-level intelligent machine.  Machines need to know about a whole lot more than is found in a single image -- and the necessary conceptual tools might not be present in the computer vision community.

A recurring theme in my blog is my belief that we must become renaissance men -- a unison of *nix hackers, vision scientists, cognitive scientists, philosophers, athletes, machine learning scientists, skilled orators, and much more -- if we are to have any hope of chiseling away at the problem of computational intelligence.  Minsky was a pioneer of computational intelligence, and his words revitalize my own research efforts in this direction.

Monday, August 23, 2010

Beyond pixel-wise labeling: Blocks World Revisited

"Thoughts without content are empty, intuitions without concepts are blind." -- Immanuel Kant 

The Holy Grail problem of computer vision research is general-purpose image understanding.  Given as input a digital image (perhaps from Flickr or from Google Image search), we want to recognize the depicted objects (cars, dogs, sheep, Macbook Pros), their functional properties (which of the depicted objects are suitable for sitting), and recover the underlying geometry and spatial relations (which objects are lying on the desk). 

The early days of vision were dominated via the "Image Understanding as Inverse Optics" mentality.  In order to make the problem easier, as well as to cope with the meager computational resources of the 60s, early computer vision researchers tried to recover the 3D geometry of simple scenes consisting of arrangements of blocks.  One of the earlier efforts in this direction, is the PhD thesis Machine Perception of Three-Dimensional Solids by Larry Roberts from MIT back in 1963.

But wait -- these block-worlds are unlike anything found in the real world!  The drastic divide between the imagery that vision researchers were studying in the 60s and what humans observe during their daily experiences ultimately led to the disappearance of block-worlds in computer vision research.

Image Parsing Concept Image from Computer Blindness Blog

Over the past couple of decades, we have seen the success of Machine Learning, and it is of no surprise that we are currently living in the "Image Understanding as statistical inference" era.  While a single 256x256 grayscale image might have been okay to use in the 1960s, today's computer vision researchers use powerful computer clusters and do serious heavy-lifting on millions of real-world megapixel images.  The man-made blocks-world of the 1960s is a thing of the past, and the variety found on random images downloaded from Flickr is the complexity we must now cope with.

While the style of computer vision research has shifted since its early days in the 1960s/1970s,  many old ideas (and perhaps prematurely considered outdated) are making a comeback!

Assigning basic-level object category labels to pixels is a very popular theme in vision.  Unfortunately, to gain a deeper understanding of an image, robots will inevitably have to go beyond pixel-level class labels.  (This is one of the central themes in my thesis -- coming out soon!)  Given human-level understanding of a scene, it is trivial to represent it as a pixel-wise labeling map, but given a pixel-wise labeling map it is not trivial to convert it to human-level understanding. 

What sort of questions can be answered about a scene when the output of an "image understanding" system is represented as a pixel-wise label map?

1. Is there a car in the image?
2. Is there a person at this location in the image?

What questions cannot be answered given a pixel-wise label map?

1. How many cars are in this image? (While there are some approaches that strive to deal with delineating object instance boundaries, most image parsing approaches fail to recognize boundaries between two instances of the same category)
2. Which surfaces can I sit on?
3. Where can I park my car?
4. How geometrically stable are the objects in the scene?

While I have more criticisms than tentative solutions, I believe that vision students shouldn't be parochially preoccupied with solely the most recent approach to image understanding.  It is valuable to go back several decades in the literature and gain a broader perspective on image understanding.  However, some progress is being made!  A deeply insightful upcoming paper from ECCV 2010, is the following:

Abhinav Gupta, Alexei A. Efros and Martial Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics, European Conference on Computer Vision, 2010. (PDF)

What Abhinav Gupta does very elegantly in this paper is connect the blocks-world research of the 1960s with the geometric-class estimation problem, as introduced by Derek Hoiem.  While the final system is evaluation in a Hoiem-like pixel-wise labeling task, the actual scene representation is 3D.  The blocks in this approach are more abstract than the Lego-like volumes in the 1960s -- Abhinav's blocks are actually cars, buildings, and trees. I included the infamous Immanuel Kant quote, because I feel it describes Abhinav's work very well.  Abhinav introduces the block as a theoretical construct which glues together a scene's elements and provides a much more solid interpretation -- Abhinav's blocks add the content to geometric image understanding which is lacking in the purely pixe-wise approaches.

While integrating large-scale categorization into this type of geometric reasoning is still an open problem, Abhinav provides us visionaries with a glimpse of what image understanding should be.  The integration of robotics with image understanding technology will surely drive pixel-based "dumb" image understanding approaches to extinction.

Thursday, June 17, 2010

more papers to check out from cvpr

Here are more CVPR 2010 papers which I either found interesting or plan on reading when I get back to PIT.  Enjoy!

Connecting Modalities: Semi-supervised Segmentation and Annotation of Images Using Unaligned Text Corpora  
Authors:  Richard Socher (Stanford University) , Li Fei-Fei (Stanford University) 

Cascade Object Detection with Deformable Part Models  
Authors:  Pedro Felzenszwalb (University of Chicago) , Ross Girshick (University ) , David McAllester (Toyota Technological Institute, Chicago) 

Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning  
Authors:  Behjat Siddiquie (UMIACS) , Abhinav Gupta (Carnegie Mellon University)  

Tiered Scene Labeling with Dynamic Programming  
Authors:  Pedro Felzenszwalb (University of Chicago) , Olga Veksler (University of Western Ontario) 

Layered Object Detection for Multi-Class Segmentation  
Authors:  Yi Yang (UCI) , Sam Hallman () , Deva Ramanan () , Charless Fowlkes (UC Irvine) 

Efficiently Selecting Regions for Scene Understanding  
Authors:  M. Pawan Kumar (Stanford University) , Daphne Koller (Stanford)   

Image Webs: Computing and Exploiting Connectivity in Image Collections
Authors:  Kyle Heath (Stanford) , Natasha Gelfand (Nokia Research - Palo Alto, CA) , Maks Ovsjanikov (Stanford University) , Mridul Aanjaneya (Stanford University) , Leonidas Guibas (Stanford University)

Sunday, June 13, 2010

constrained parametric min-cuts: exciting segmentation for the sake of recognition

I would like to introduce two papers about Constrained Parametric Min-Cuts from C. Sminchisescu's group.  These papers are very relevant to my research direction (which lies at the intersection of segmentation and recognition).  Like my own work, these papers are about segmentation for recognition's sake.  The segmentation algorithm proposed in the paper is a sort of "segment sliding approach", where many binary graph-cuts optimization problems are solved for different Grab-Cut style initializations.  These segments are then scored using a learned scoring function -- think regression versus classification.  They show that these top segments are actually quite meaningful and correspond to object boundaries really well.  Finally a tractable number of top hypothesis (still overlapping at this stage), are piped into a recognition engine.

The idea that features derived from segments are better for recognition than features from the spatial support of a sliding rectangle resonates in all of my papers.  Regarding these CVPR2010 papers, I like their ideas of learning a category-free "segmentation-function" and the sort of multiple-segmentation version of this algorithm is very appealing.  If I remember correctly, the idea of learning a segmentation function comes to us from X. Ren, and the idea of using multiple segmentation comes from D. Hoiem. These papers are a cool new idea utilizing both insights.

J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In CVPR 2010.

F. Li, J. Carreira, and C. Sminchisescu. Object Recognition as Ranking Holistic Figure-Ground Hypotheses. In CVPR 2010.


Spotlights for these papers are during these tracks at CVPR2010:
Object Recognition III: Similar Shapes
Segmentation and Grouping II: Semantic Segmentation tracks

Sven Dickinson at POCV 2010, ACVHL Tomorrow

This morning, Sven Dickinson gave a talk to start the POCV 2010 Workshop at CVPR2010.  For those of you who might not know, POCV stands for Perceptual Organization in Computer Vision.  While segmentation can be thought of as a perceptual grouping process, contiguous regions don't have to be the end product of a meaningful perceptual grouping process.  There are many popular and useful algorithms which group non-accidental contours yet come short of a full-blown image segmentation.

The title of Dickinson's talk was "The Role of Intermediate Shape Priors in Perceptual Grouping and Image Abstraction." In the beginning of his talk, Sven pointed out how perceptual organization was at its prime in the mid 90s and declined in the 2000s due to the popularity of machine learning and the "detection" task.  He believes that good perceptual grouping is what is going to make vision scale -- that is, without first squeezing out all that we can out of the bottom level we are doomed to fail.

Dickinsons showed some nice results from his most recent research efforts where objects are broken down into generic "parts" -- this reminded me of Biederman's geons, although Sven's fitting is done in the 2D image plane.  Sven emphasized that successful shape primitives must be category-independent if we are to have scalable recognition of thousands of visual concepts in images.  This is much different than the mainstream per-category object detection task which has been popularized by contests such as the PASCAL VOC.

While I personally believe that there is a good place for perceptual organization in vision, I wouldn't view it as the Holy Grail.  It is perhaps the Holy Bridge we must inevitably cross on the way to finding the Holy Grail.  I believe that for full-grown fully-functional members of society, our ability to effortlessly cope with the world is chiefly due to its simplicity and repeatability, and not due to some amazing internal perceptual organization algorithm.  Perhaps it is when we were children -- viewing the world through a psychedelic fog of innocence -- that perceptual grouping helped us cut up the world into meaningful entities.

A common theme in Sven's talk was the idea of Learning to Group in a category-independent way.  This means that all of the successes of Machine Learning aren't thrown out the door, and this appears to a quite different way of grouping than what has been done in the 1970s.

Tomorrow I will be at ACVHL Workshop "Advancing Computer Vision with Humans in the Loop".  I haven't personally "turked" yet, but I feel I will be jumping on the bandwagon soon.  Anyways, the keynote speakers should make for an awesome workshop.  They do not need introductions: David Forsyth, Aude Oliva, Fei-Fei Li, Antonio Torralba, and Serge Belongie -- all influential visionaries.

everything is misc -- torralba cvpr paper to check out

Weinberger's Everything is Miscellaneous is a delightful read -- I just finished it today while flying from PIT to SFO.  It was recommended to me by my PhD advisor, Alyosha, and now I can see why!  Many of the key motivations behind my current research on object representation deeply resonate in Weinberger's book.

Weinberger motivates Rosch's theory of categorization (the Prototype Model), and explains how it is a significant break from the thousand years of Aristotelian thought.  Aristotle gave us the notion of a category -- centered around the notion of a definition.  For Aristotle, every object can be stripped to its essential core, and place in its proper place in a God's-eye objective organization of the world.  It was Rosch who showed us that categories are much fuzzier and more hectic than suggested by the rigid Aristotelian system. Just like Copernicus single-handedly stopped the Sun and set the Earth in motion, Rosch disintegrated our neatly organized world-view and demonstrated how an individual's path through life shapes h/er concepts.

I think it is fair to say that my own ideas as well as Weinberger's aren't so much an extension of the Roschian mode of thought, but also a significant break from the entire category-based way of thinking.  Given that Rosch studied Wittgenstein as a student, I'm surprised her stance wasn't more extreme, more along the anti-category line of thought.  I don't want to undermine her contribution to psychology and computer science in any way, and I want to be clear that she should only be lauded for her remarkable research.  Perhaps Wittgenstein was as extreme and iconoclastic as I like my philosophers to be, but Rosch provided us with a computational theory and not just a philosophical lecture.

From my limited expertise in theories of categorization in the field of Psychology, whether it is Prototype Models or the more recent data-driven Exemplar Models, these theories are still theories of categories.  Whether the similarity computations are between prototypes and stimuli, or between exemplars and stimuli, the output of a categorization model is still a category.  Weinberger is all about modern data-driven notions of knowledge organization, in a way that breaks free from the imprisoning notion of a category.  Knowledge is power, so why imprison it in rigid modules called categories?  Below is a toy visualization of a web of concepts, as imagined by me.  This is very much the web-based view of the world.  Wikipedia is a bunch of pages and links.

Artistic rendition of a "web of concepts"

I found it valuable to think of the Visual Memex, the model I'm developing in my thesis research, as an anti-categorization model of knowledge -- a vast network of object-object relationships.  The idea of using little concrete bits of information to create a rich non-parametric web is the recurring theme in Weinberger's book.  In my case, the problem of extracting primitives from images, and all of the problem in dealing with real-world images are around to plague me, and the Visual Memex must rely on many Computer Vision techniques -- such things are not discussed in Weinberger's book.  The "perception" or "segmentation" component of the Visual Memex is not trivial -- where linking words on the web is much easier.

CVPR paper to look out for

However, the category-based view is all around us.  I expect most of this year's CVPR papers to fit in this category-based view of the world. One paper, co-authored by the great Torralba, looks relevant to my interests.  It is yet another triumph for the category-based mentality in computer vision.  In fact, one of the figures in the paper demonstrates the category-based view of the world very well.  Unlike the memex, the organization is explicit in the following figure:

Myung Jin Choi, Joseph Lim, Antonio Torralba, and Alan S. Willsky. CVPR 2010.

Friday, June 11, 2010

blogging from CVPR2010

It might not be one of those glamorous Apple events during which Steve Jobs introduces a new shiny gadget for the masses to desire, but plenty of exciting stuff happens at CVPR which instills desire into our souls, that is, the souls of computer vision scientists. Wouldn't you rather see the Great Torralba give a talk over some big company's chief executive officer? For those of you who do not know what CVPR is -- it is one of the big Computer Vision conferences during which we (the geeks, scientists, engineers, developers, hackers, and mathematicians) exchange ideas regarding our most recent research in the world of computer vision.

I am flying to SF tomorrow morning, and will be blogging about some of the cool papers I encounter at this year's CVPR. I do not have a paper at this year's conference so I'm in full assimilate-knowledge mode where I hope to absorb thousands of ideas related to my field. I already mentioned some of Kristen Grauman's cool segmentation papers, but expect to see in the next several blog posts many additional discussions for what I think are "exciting" papers. I am already getting excited and have plenty of papers to read during my flight, in addition to finishing Everything is Miscellaneous. I will be blogging from CVPR, like an Apple fanboy would at one of those Apple WWDC events -- but I will share math, theory, algorithms, and the like.

As always, the list of CVPR 2010 papers on the web can be found here.

Tuesday, May 25, 2010

my average face && second half marathon

In case you didn't know, Picasa now performs face recognition in your photos.  I found it amusing to see the progression of my own face over the past several years.  Picasa lets you extract these face tiles into an 'export' directory, and it is trivial to load them up in Matlab for additional fun.  I produced some TorralbaArt by averaging over 400 faces of myself (with no alignment whatsoever) collected over the past several years.  These photos come from my personal photo collection, so I'm not making them publicly available.  But here's the average face!  I resized all images to 500x500 before averaging and resized the average to the mean aspect ratio of all images.  The "black-eyes" come from the fact that I was wearing black sunglasses in about 10% of the photos.

On another note, I ran the Pittsburgh Half Marathon this year.  This was my second half marathon ever -- my first one was last summer in San Francisco.  This time, my finishing time was 1hour 40 minutes, which happens to be the goal I set for myself (10 minutes faster than my SF time).  The first 20 minutes I was passing everybody in front of me since it was quite crowded.  I could probably shave another 2-3 minutes off if I start towards the front of the herd, but I'll need some serious training if I'm going to reach 1:30 in a future race.  I might even run a full marathon one of these days...

Sunday, May 09, 2010

graph visualizations as sexy as fractals

I love to display mathematical phenomena -- often for me the proof is in the visualization. If you ever steal one of my personal research notebooks you'll see that the number of graphs I've been drawing over the years has been increasing at a steady rate. This is a habit I acquired from studying Probabilistic Graphical Models and the machine learning-heavy curriculum at CMU.

Back in high school I was amazed by the beauty of fractals based on Newton's method for finding roots, but as I've slowly been shifting my mode of thought from continuous optimization problems to discrete ones, automated graph visualization is as close as I've ever gotten to being an artist. Here is one such sexy graph visualization from Yifan Hu's gallery.

Andrianov/lpl1 via sfdp by Yifan Hu

I have been using Graphviz for about 8 years now, and I just can't get enough. I never thought it would produce anything as beautiful as this! I generally used graphviz to produce graphs like this:

Inspired by Yifan Hu and his amazing multilevel force directed algorithm for visualizing graphs I've started using sfdp for some of my own visualizations. sfdp is now inside graphviz, and can be used with the -K switch as follows (also with overlap=scale):

$ dot -Ksfdp -Tpdf memex.gv > memex.pdf

Inspired by Yifan Hu's coloring scheme based on edge length, I color the edges using a standard matlab jet colormap with shorter edges being red and longer ones being blue. To get the resulting lengths of edges, I actually run sfdp twice -- once to read off the vertex positions (this is what the graph drawing optimization produces), and once again to assign the edge colors based on those lengths. I could process the resulting postscript with one run like Yifan, but I don't want to figure out how to parse postscript files today. Here is an example using some of my own data.

Car Concept Visual Memex via sfdp by Tomasz Malisiewicz

This is a visualization of the car subset of the Visual Memex I use as an internal organization of visual concepts to be used for image understanding. If you click on this image, it will show you a significantly larger png.

As a sanity check, I also created a visualization of a standard UF Sparse Matrix (here is both mine and Yifan's result)
UTM1700b via sfdp by Yifan Hu

UTM1700b via sfdp by Tomasz Malisiewicz

As you can see, the graphs are pretty similar, modulo some coloring strategy differences -- but since the colors are somewhat arbitrary this is not an issue. If you click on these pictures you can see the PDFs which were generated via graphviz. Now only if my real-world computer vision graph were as structured as these toy problems then others could view me as both an artist and a scientist (like a true Renaissance man).

Wednesday, April 14, 2010

Internet-scale vision at CVPR 2010

I've been reading some of the recent CVPR 2010 papers (check out the CVPR papers on the web page to see the full list), and I came a cool video produced by Yasutaka Furukawa. I met Yasutaka when I was a visitor at Jean Ponce's WILLOW group in Paris during Spring 2008, and I was truly amazed by some of the cool geometry-based work he has done. Being a recognition/machine-learning guy myself, I can only appreciate and wonder at the amazing work produced by in-depth knowledge of geometry. In this particular case, the images aren't ones that Yasutaka collected himself. The idea behind internet-scale vision is that you can use the millions of photos on sites such as Flickr.

Here is a cool video below, very much in the spirit of Photosynth.

It is also not a surprise to find that Yasutaka is now working at Google. One can only imagine where Google is going to apply the "Street-View" mentality next. Cities like NYC already have nice high-resolution building facades, see picture below from Google Earth Blog.

I want to one day run all of my object recognition experiments on Google Street view, and there is probably only a handful of places in the world that have the computational infrastructure to play with such experiments. I drool at the idea of one day building a Visual Memex from billions of online images (and this can only happen at at place like Google).

Monday, April 05, 2010

Ontology is Overrated: Categories, Links, and Tags

This is the title of a powerful treatise written by Clay Shirky, in which he strives to "convince you that a lot of what we think we know about categorization is wrong."  Much thanks to David Weinberger's blog for pointing out this article.  The take home message is quite similar to some of the "Beyond Categories" ideas I've tried to promulgate in my meager attempt to understand why progress in computer vision has reached a standstill.  For anybody interested in understanding the limitations of classical systems of categorization, this article is a worth a read.

Exciting Computer Vision papers from Kristen Grauman's UT-Austin Group

Back in 2005, I remember meeting Kristen Grauman at MIT's accepted PhD student open house.  Back then she was a PhD student under Trevor Darrell (and is known for her work on the Pyramid Match Kernel), but now she has her own vision group at UT-Austin.  She is the the advisor behind many cool vision projects there, and here are a few segmenatation/categorization related papers from the upcoming CVPR2010 conference.  I look forward to checking out these papers because they are relevant to my own research interests.  NOTE: some of the papers links are still not up -- I just used the links from Kristen's webpage.

Monday, March 22, 2010

PhDs make many smart programmers become software engineering n00bs

This is true. A couple of years in a PhD program -- reading papers and writing throw-away code in Matlab, and it easy to become a throw-away programmer, a sort of liability in the real world. It is no surprise many companies look down on hiring PhDs. I've seen kids enter the PhD program with real programming talent and exit real software engineering n00bs. In graduate school, you might code for 6 years without anybody grading your code. If you get sloppy, you will be worse off than when you started.

The problem is that many advisors don't care about their students writing good code. Writing good papers and giving good presentations -- you will be told that this is what makes good PhD students. Who cares about writing good code? -- we'll just have some 'engineering' people re-write it once you become famous. This is what students across the globe are being fed. This is no surprise, because your advisor won't get tenure by turning you into a mean mathematically-inclined super hacker. Then again, your advisor won't care if you go bald, are malnutritioned, and have no life outside research. There are many things that one has to take care of themselves, and software development skills aren't any different.

Note to the real world looking to hire talent: You should grill, I mean really grill fresh PhDs regarding the software development skills. Don't become mesmerized by their 4.0s, their long publication lists, and all their 'achievements.' If you want to hire a fresh PhD to write code, whether in a research or an engineering setting, then give them one hell-of-an-interview. I agree with Google's interview process. I studied for it, I am proud of my own software engineering skills, and I was proud to have been an intern at Google (twice). But I know of companies who were sorry they hired PhDs only to learn these recent graduates could only dabble on the board and would utterly fail at the terminal.

Note to PhDs looking to one day take our skill-set and impact the real world: Never stop learning and never stop writing good code. Never stop taking care of yourself. You were the brightest of the brightest before you started your PhD, and now you have 5-6 years to exit as a real superman. With all the mathematics and presentations skills you will acquire during a PhD ,on top of good software engineering skills, you will become invaluable to the real world. Its a real shame to become less valuable to the outside world after 6 years of a strenuous PhD program. But nobody will give you the recipe for success. Nobody will tell you to exercise, but if you want to pound your brain with mental challenges for decades to come, you will need physical exercise in your daily regiment. Your advisors won't tell you that keeping up to date on the tools of the trade, and being a real hacker, is very valuable in the real world. You will be told that fast results = many papers and its not worth writing good code.

After obtaining a PhD we should be role-models for the entire world. Seriously, why not? If a PhD is the highest degree that an institution can grant, then we should feel proud about getting one. But we are human, and one is only as strong as their weakest link. We should become super hackers, fear no quantum mechanics, fear no presentation in front of a crowd, and be all that one can be.

This is a part of a serious of posts aimed at finding flaws in the academic/PhD process and how it pertains to building strong/intelligent/confident individuals.

Thursday, March 18, 2010

Back to basics: Vision Science Transcends Mathematics

Vision (a.k.a. image understanding, image interpretation, perception, object recognition) is quite unlike some of the mathematical problems we were introduced to in our youth. In fact, thinking of vision as a "mathematical problem" in the traditional sense is questionable. An important characteristic of such "problems" is that by pointing them out we already have a notion of what it would be like to solve them. Does a child think of gravity as such a problem? Well, probably not, because without the necessary mathematical backbone there is no problem with gravity! It's just the way the world works! But once a child has been perverted by mathematics and introduced into the intellectual world of science, the world ceases to just be. The world become a massive equation.

Consider the seemingly elementary problem of finding the roots of a cubic polynomial. Many of us can recite the quadratic equation by heart, but not the one for cubics (try deriving the simpler quadratic formula by hand). If we were given one evening and a whole lot of blank paper, we could try tackling this problem (no Google allowed!). While the probability of failure is quite high (and arguably most of us would fail), it would still make sense of "coming closer to the solution". Maybe we could even solve the problem when some terms are missing, etc. The important thing here is that the notion of having reached a solution is well-defined. Also, once we've found the solution it would probably be easier to convince ourselves that it is correct (verification would be easier than actually coming up with the solution).

Vision is more like theoretical physics, psychology, and philosophy and less like the well-defined math problem I described above. When dealing with the math problem described above, we know what the symbols mean, we know valid operations -- the game is already set in place. In vision, just like physics, psychology and philosophy, the notion of a fundamental operational unit (which happens to be an object for vision) isn't rigidly defined as the Platonic Ideals used throughout mathematics. We know what a circle is, we know what a real-valued variable is, but what is a "car"? Consider your mental image of a car. Now remove a wheel and ask yourself, is this still a car? Surely! But what happens as we start removing more and more elements. At what point does this object cease to be a car and become a motor, a single tire, or a piece of metal? The circle, a Platonic Ideal, ceases to become a circle once it has suffered the most trivial of all perturbations -- any deviation from perfection, and boom! the circle ceases to be a circle.

Much of Computer Vision does not ask such metaphysical questions, as objects of the real world are seamlessly mapped to abstract symbols that our mathematically-inclined PhD students love to play with. I am sad to report that this naive mapping between objects of the real world and mathematical symbols isn't so much a questions of style, it is basically the foundation of modern computer vision research. So what must be done to expand this parochial field of Vision into a mature field? Wake up and stop coding! I think Vision needs a sort of a mental coup d'état, a fresh outlook on old problem. Sometimes to make progress we have start with a clean slate -- current visionaries do not possess the right tools for this challenging enterprise. Instead of throwing higher-level mathematics at the problem, maybe we are barking up the wrong tree? However, if mathematics is the only thing we are good at, then how are we to have a mature discussion which transcends mathematics? The window through which we peer circumscribes the world we see.

I believe if we are to make progress in this challenging endeavor, we must first become Renaissance men, a sort of Neitzschean Übermensch. We must understand what has been said about perception, space, time, and the structure of the universe. We must become better historians. We must study not only more mathematics, but more physics, more psychology, read more Aristotle and Kant, build better robots, engineer stabler software, become better sculptors and painters, become more articulate orators, establish better personal relationships, etc. Once we've mastered more domains of reality, and only then, will we have a better set of tools for coping with paradoxes inherent in artificial intelligence. Because a better grasp on reality -- inching closer to enlightenment -- will result in asking more meaningful questions.

I am optimistic. But the enterprise which I've outlined will require a new type of individual, one worthy of the name Renaissance Man. We aren't interested in toy problems here, nor cute solutions. If we want to make progress, we must shape our lives and outlooks around this very fact. Two steps backwards and three steps forward. Rinse, lather, repeat.

Friday, March 05, 2010

Representation and Use of Knowledge in Vision: Barrow and Tenenbaum's Conclusion

To gain a better perspective on my research regarding the Visual Memex, I spent some time reading Object Categorization: Computer and Human Vision Perspectives which contains many lovely essays on Computer Vision. This book contains recently written essays by titans of Computer Vision and contains a great deal lessons learned from history. While such a 'looking back' on vision makes for a good read, it is also worthwhile to find old works 'looking forward' and anticipating the successes and failures of the upcoming generations.

In this 'looking forward' fashion, I want to share a passage regarding image understanding systems, from "Representation and Use of Knowledge in Vision," by H. G. Barrow and J. M. Tenenbaum, July 1975. This is a short paper worth reading for both graduate students and professors interested in pushing Computer Vision research to its limits. I enjoyed the succinct and motivational ending so much, it is worth repeating it verbatim:


III Conclusion

We conclude by reiterating some of the major premises underlying this paper:

The more knowledge the better.
The more data, the better.

Vision is a gigantic optimization problem.

Segmentation is low-level interpretation using general knowledge.

Knowledge is incrementally acquired.

Research should pursue Truth, not Efficiency.

A further decade will determine our skill as visionaries.


Friday, February 19, 2010

Data-Driven Image Parsing With the Visual Memex: Thesis Proposal Complete!

Yesterday, I successfully gave my thesis proposal talk at CMU and it was a great experience. The feedback I obtained from my committee members was invaluable, especially the comments from Takeo Kanade. It was a great honor for me to have Takeo Kanade, one of the titans of vision, on my committee. My external member, Pietro Perona, is also a key player object recognition, and provided some perceptive comments.

I gave my talk on my Macbook Pro using Keynote. I use the dvi output to connect to the projector and on my screen I was able to see the current slide as well as the upcoming slide. Using Skype I was able to connect to Pietro in California and share the presentation screen (not my two-slide screen!) with him. This was, he was able to follow along and see the same slides as everybody in the room. Skype was a great success!

I would like to thank everybody who came to my talk!

Tuesday, January 26, 2010

Beyond Categories =? Doing without Concepts

The term "beyond category," from my limited knowledge, was originally coined to describe the music of Duke Ellington. It is a term of praise that acknowledges that one's style is inimitable and transcends barriers.

"Beyond Categories" was the first part of my NIPS 2009 paper's title. To "go beyond" means to transcend, to abandon or do without some limitation and strive higher -- there is nothing magical about my use of the term. I used the term category to refer to object categories, as are commonly used in computer vision, artificial intelligence, machine learning, as well as psychology, philosophy, and other branches of cognitive science. One of my research goals is to go beyond the use of categories as the basis for machine perception and visual reasoning. It has been argued by Machery that the term category is roughly equivalent to the term concept as used in psychology literature. In some sense the title of Machery's recent book, "Doing without concepts," is analogous to the phrase "Beyond categories" but to reassure myself I'll have to finish reading Machery's book.

So far the first chapter has been a delightful exposition into the world of concepts, a term dear to researchers in machine perception (AI) as well as human categorization (psychology). I look forward to reading the rest of the book, which I accidentally found while looking for Estes' book on categorization. I had already digested/assimilated some of Machery's work, in particular his paper titled Concepts are not a natural kind, so seeing his name on a book at the CMU library piqued my interest. In this 2005 paper, Machery argues that the debate between prototypes vs. exemplars vs. theories in the literature on concepts is not well-founded and there is no reason to believe a single theory should prevail. I'll attempt to summarize some of his take-home messages and their relevance to computer vision once I finish this book.

Wednesday, January 20, 2010

Heterarchies and Control Structure in Image Interpretation

Several days ago I was reading one of Takeo Kanade's classic computer vision papers from 1977 titled "Model Representation and Control Structure in Image Understanding" and I came across a new term, heterarchy. I think motivating this concept is as important as its definition. At the representational level, Kanade does a good job at advocating the use of multiple levels of representation -- from pixels to patches to regions to subimages to objects. In addition to discussing the representational aspects of image understanding systems, Kanade analyzes different strategies for using knowledge in such systems (he uses the term control structure to signify the overall flow of information between subroutines). On one extreme is pass-oriented processing (this is Kanade's term -- I prefer to use the terms feed-forward or bottom-up) which relies on iteratively building higher levels of interpretation from lower ones. Marr's vision pipeline is mostly bottom-up, but that discussion will be left for another post. Another extreme is top-down processing, where the image is analyzed in a global-to-local fashion. Of course, as of 2010 these ideas are being used on a regular basis in vision. One example is the paper Learning to Combine Bottom-Up and Top-Down Segmentation by Levin and Weiss.

Kanade acknowledges that the flow of a vision algorithm is very much dependent on the representation used. For image understanding, bottom-up as well as top-down processing will both be critical components of the entire system. However the exact strategy for combining these processes, in addition to countless other mid-level stages, is not very clear. Directly quoting Kanade, "The ultimate style would be a heterarchy, in which a number of modules work together like a community of experts with no strict central executive control." According to this line of thought, processing would occur in a loopy and cooperative style. Kanade attributes the concept of a heterarchy to Patrick Winston who worked with robots in the golden days of AI at MIT. Like Kanade, Winston criticizes a linear flow of information in scene interpretation (this criticism dates back to 1971). The basic problem outlined by both Kanade and Winston is that modules such as line-finders and region-finders (think segmentation) are simply not good enough to be used in subsequent stages of understanding. In my own research I have used the concept of multiple image segmentations to bypass some of the issued with relying on the output of low/mid -level processing for high-level processing. In 1971 Winston envisioned an algorithmic framework that is a melange of subroutines -- a web of algorithms created by different research groups -- that would interact and cooperate to understand an image. This is analogous to the development of an operating system like Linux. There is no overall theory developed by a single research group that made Linux a success -- it is the body of hackers and engineers that produced a wide range of software products that make using Linux a success.

Unfortunately given the tradition of computer vision research, I believe that an open-source-style group effort in this direction will not come out of university-style research (which is overly coupled with the publishing cycle). It would be a noble effort, but would more of a feat of engineering and not science. Imagine a group of 2-3 people creating an operating system from scratch -- it seems like a crazy idea in 2010. However, computer vision research is often done in such small teams (actually there is often a single hacker behind a vision project). But maybe going open-source and allowing several decades of interaction will actually produce usable image understanding systems. I would like to one day lead such an effort -- being both the theoretical mastermind as well as the hacker behind this vision. I am an INTJ, hear me roar.

Monday, January 18, 2010

Understanding versus Interpretation -- a philosophical distinction

Today I want to bring up an interesting discussion regarding the connotation of the word "understanding" versus "interpretation," particularly in the context of "scene understanding" versus "scene interpretation." While many vision researchers use these terms interchangeably, I think it is worthwhile to make the distinction, albeit a philosophical one.

On Understanding
While everybody knows that the goal of computer vision is to recognize all of the objects in an image, there is plenty of disagreement about how to represent objects and recognize them in the image. There is a physicalist account (from Wikipedia: Physicalism is a philosophical position holding that everything which exists is no more extensive than its physical properties), where the goal of vision is to reconstruct veridical properties of the world. This view is consistent with the realist stance in philosophy (think back to Philosophy 101) -- there exists a single observer-independent 'ground-truth' regarding the identities of all of the objects contained in the world. The notion of vision as measurement is very strong under this physicalist account. The stuff of the world is out there just waiting to be grasped! I think the term "understanding" fits very well into this truth-driven account of computer vision.

On interpretation
The second view, a postmodern and anti-realist one, is of vision as a way of interpreting scenes. The shift is from veridical recovery of the properties of the world from an image (measurement) to the observer-dependent interpretation of the input stimulus. Under this account, there is no need to believe in a god's eye 'objective' view of the world. Image interpretation is the registration of an input image with a vast network of past experience, both visual and abstract. The same person can vary their own interpretation of an input as time passes and the internal knowledge based has evolved. Under this view, two distinct robots could provide very useful yet distinct 'image interpretations' of the same input image. The main idea is that different robots could have different interpretation-spaces, that is they could obtain incommensurable (yet very useful!) interpretations of the same image.

It has been argued by Donald Hoffman (Interface Theory of Perception) that there is no reason why we should expect evolution to have driven humans towards veridical perception. In fact, Hoffman argues that natures drives veridical perception towards extinction and it only makes sense to speak of perception as guiding agents towards pragmatic interpretations of their environment.

In philosophy of science, there is the debate of whether the field of physics is unraveling some ultimate truth about the world versus physics painting a coherent and pragmatic picture of the world. I've always viewed science as an art and I embrace my anti-realist stance -- which has been shaped by Thomas Kuhn, William James, and many others. While my scientific interests have currently congealed in computer vision, it is no surprise that I'm finding conceptual agreement between my philosophy of science and my concrete research efforts in object recognition.

Thursday, January 14, 2010

Tuesday, January 12, 2010

Image Interpretation Objectives

An example of a typical complex outdoor natural scene that a general knowledge-based image interpretation system might be expected to understand is shown in Figure 1. An objective of such systems is to identify semantically meaningful visual entities in a digitized and segmented image of some scene. That is, to correctly assign semantically meaningful labels (e.g., house, tree, grass, and so on) to regions in an image -- see [29,30]. A computer-based image interpretation system can be viewed as having two major components, a "low-level" component and a "high-level" component [19],[31]. In many respects, the low-level portion of the system is designed to mimic the early stages of visual image processing in human-like systems. In these early stages, it is believed that scenes are partitioned, to some extent, into regions that are homogeneous with respect to some set of perceivable features (i.e., feature vector) in the scene [6],[40],[39]. To this extent, most low-level general purpose computer vision systems are designed to perform the same task. An example of a partitioning (i.e., segmentation) of Figure 1 into homogeneous regions is shown in Figure 2. The knowledge-based computer vision system we shall describe in this paper is not currently concerned with resegmenting portions of an image. Rather, its task is to correctly label as many regions as possible in a given segmentation.

This a direct quote from a 1984 paper on computer vision. A great example of segmentation-driven scene understanding. The content is similar enough to my own line of work that it could have been an excerpt from my own thesis.

It is actually in a section called Image Interpretation Objectives from "Evidential Knowledge-Based Computer Vision" by Leonard P. Wesley, 1984. I found this while reading lots of good tech reports from SRI International's AI Center in Menlo Park. Some good stuff there by Tenenbaum, Barrow, Duda, Hart, Nillson, Fischler, Pereira, Pentland, Fua, Szeliski, to name a few. Lots of stuff there is relevant to scene understanding and grounds the problem in robotics (since there was no "internet" vision back in the 70s and 80s).

On another note, I still haven't been able to find a copy of the classic paper, Experiments in Interpretation-Guided Segmentation by Tenenbaum and Barrow from 1978. If anybody knows where to find a pdf copy send me an email. UPDATE: Thanks to the quick reply! I have the paper now.