Tombone's Computer Vision Blog: rosch

Showing posts with label rosch. Show all posts

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere. Detectors are not only getting faster, but getting stylish. Edges are making a comeback. HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level Categories, Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference. It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection. However this paper is not just about edges. Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees." Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project. It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above. There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space. One great example is how the appearance of cars has changed over the past several decades. Another example is how typical Google Street View images change as a function of going North-to-South in the United States. Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros. I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013. www.neil-kb.com

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper. A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about. This is machine learning, this is AI, this is the future. None of this train on my favourite dataset, test on my favourite dataset bullshit. If there's anybody that's going to do it the right way, its the CMU gang. This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning. The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier? Then this paper is for you. Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection. The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure. Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression. Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010. A recent blog post from Piotr Dollár (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data. Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation. Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues. This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes. While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs. Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos? Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds. Well, now there's an algorithm for that. Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013. sun3d.cs.princeton.edu

Xiao et al, continue their hard-core data collection efforts. Now in 3D. In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers. Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle. In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place. And not any sort of crazy, click-here, click-there interfaces. Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic. For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao. Congratulations on the new faculty position! Princeton has been starving for a person like you. If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly. Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars, ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list. For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.

Sunday, June 13, 2010

everything is misc -- torralba cvpr paper to check out

Weinberger's Everything is Miscellaneous is a delightful read -- I just finished it today while flying from PIT to SFO. It was recommended to me by my PhD advisor, Alyosha, and now I can see why! Many of the key motivations behind my current research on object representation deeply resonate in Weinberger's book.

Weinberger motivates Rosch's theory of categorization (the Prototype Model), and explains how it is a significant break from the thousand years of Aristotelian thought. Aristotle gave us the notion of a category -- centered around the notion of a definition. For Aristotle, every object can be stripped to its essential core, and place in its proper place in a God's-eye objective organization of the world. It was Rosch who showed us that categories are much fuzzier and more hectic than suggested by the rigid Aristotelian system. Just like Copernicus single-handedly stopped the Sun and set the Earth in motion, Rosch disintegrated our neatly organized world-view and demonstrated how an individual's path through life shapes h/er concepts.

I think it is fair to say that my own ideas as well as Weinberger's aren't so much an extension of the Roschian mode of thought, but also a significant break from the entire category-based way of thinking. Given that Rosch studied Wittgenstein as a student, I'm surprised her stance wasn't more extreme, more along the anti-category line of thought. I don't want to undermine her contribution to psychology and computer science in any way, and I want to be clear that she should only be lauded for her remarkable research. Perhaps Wittgenstein was as extreme and iconoclastic as I like my philosophers to be, but Rosch provided us with a computational theory and not just a philosophical lecture.

From my limited expertise in theories of categorization in the field of Psychology, whether it is Prototype Models or the more recent data-driven Exemplar Models, these theories are still theories of categories. Whether the similarity computations are between prototypes and stimuli, or between exemplars and stimuli, the output of a categorization model is still a category. Weinberger is all about modern data-driven notions of knowledge organization, in a way that breaks free from the imprisoning notion of a category. Knowledge is power, so why imprison it in rigid modules called categories? Below is a toy visualization of a web of concepts, as imagined by me. This is very much the web-based view of the world. Wikipedia is a bunch of pages and links.

Artistic rendition of a "web of concepts"

I found it valuable to think of the Visual Memex, the model I'm developing in my thesis research, as an anti-categorization model of knowledge -- a vast network of object-object relationships. The idea of using little concrete bits of information to create a rich non-parametric web is the recurring theme in Weinberger's book. In my case, the problem of extracting primitives from images, and all of the problem in dealing with real-world images are around to plague me, and the Visual Memex must rely on many Computer Vision techniques -- such things are not discussed in Weinberger's book. The "perception" or "segmentation" component of the Visual Memex is not trivial -- where linking words on the web is much easier.

CVPR paper to look out for

However, the category-based view is all around us. I expect most of this year's CVPR papers to fit in this category-based view of the world. One paper, co-authored by the great Torralba, looks relevant to my interests. It is yet another triumph for the category-based mentality in computer vision. In fact, one of the figures in the paper demonstrates the category-based view of the world very well. Unlike the memex, the organization is explicit in the following figure:

Exploiting Hierarchical Context on a Large Database of Object Categories

Exploiting Hierarchical Context on a Large Database of Object Categories

Myung Jin Choi, Joseph Lim, Antonio Torralba, and Alan S. Willsky. CVPR 2010.

Thursday, March 26, 2009

Beyond Categorization: Getting Away From Object Categories in Computer Vision

Natural language evolved over thousands of years to become the powerful tool that is is today. When we say things using language to convey our experiences with the world, we can't help but refer to object categories. When we say things such as "this is a car" what we are actually saying is "this is an instance from the car category." Categories let us get away from referring to individual object instances -- in most cases knowing that something belongs to a particular category is more than enough knowledge to deal with it. This is a type of "understanding by compression" or understanding by abstracting away the unnecessary details. In the words of Rosch, "the task of category systems is to provide maximum information with the least cognitive effort." Rosch would probably agree that it only makes sense to talk about the utility of a category system (a for getting a grip on reality) as opposed to the truth value of a category system with respect how well it aligns to observer-independent reality. The degree of pragmatism expressed by Rosch is something that William James would have been proud of.

From a very young age we are taught language and soon it takes over our inner world. We 'think' in language. Language provides us with a list of nouns -- a way of cutting up the world into categories. Different cultures have different languages that cut up the world differently and one might wonder how well the object categories contained in any given single language correspond to reality -- if it even makes sense to talk about an observer independent reality. Rosch would argue that human categorization is the result of "psychological principles of categorization" and is more related to how we interact with the world than how the world is. If the only substances we ingested for nutrients were types of grass, then categorizing all of the different strains of grass with respect to flavor, vitamin content, color, etc would be beneficial for us (as a species). Rosch points out in her works that her ideas refer to categorization at the species-level and she calls it human categorization. She is not referring to a personal categorization; for example, the way a child might cluster concepts when he/she starts learning about the world.

It is not at all clear to me whether we should be using the categories from natural language as the to-be-recognized entities in our image understanding systems. Many animals do not have a language with which they can compress percepts into neat little tokens -- yet they have no problem interacting with the world. Of course, if we want to build machines that understand the world around them in a way that they can communicate with us (humans), then language and its inherent categorization will play a crucial role.

While we ultimately use language to convey our ideas to other humans, how early are the principles of categorization applied to perception? Is the grouping of percepts into categories even essential for perception? I doubt that anybody would argue that language and its inherent categorization is not useful for dealing with the world -- the only question is how it interacts with perception.

Most computer vision researchers are stuck in the world of categorization and many systems rely on categorization at a very early stage. A problem with categorization is its inability to deal with novel categories -- something which humans must deal with at a very young age. We (humans) can often deal with arbitrary input and using analogies can still get a grip and the world around us (even when it is full of novel categories). One hypothesis is that at the level of visual perception things do not get recognized into discrete object classes -- but a continuous recognition space. Thus instead of asking the question, "What is this?" we focus on similarity measurements and ask "What is this like?". Such a comparison-based view would help us cope with novel concepts.