Friday, December 27, 2013

Understanding the Visual World, one 3D reconstruction at a time

The first generation of datasets in the computer vision community were just plain old images --  simple arrays of pixels.  Seem like nothing fancy, but we must recall that there was a time where a single image could barely fit inside a computer's memory.  During this early time, researchers showcased their image processing algorithms on the infamous Lenna image.  But later we saw datasets like the Corel dataset, Caltech 101, LabelMe, SUN, James Hays' 6 million Flick images, PASCAL VOC, and Image Net.  These were impressive collections of images and for the first time computer vision researchers did not have to collect their own images.  As Jitendra Malik once said, large annotated datasets marked the end of the "Wild Wild West" in Computer Vision -- for the first time, large datasets allowed researchers to compare object recognition algorithms on the same sets of images! What is different about these datasets is that some come with annotations at the image level, some come with annotated polygons, and some come with nothing more than objects annotated at the bounding box level.  Images are captured by a camera and annotations are produced by a human annotation effort.  But these traditional vision datasets lack depth, 3D information, or anything of that sort.  LabelMe3D was an attempt at reconstructing depth from object annotations, but it would only work in a pop-up world kind of way.

The next generation of datasets is all about going into 3D.  But not just annotated depth images like the NYU2 Depth Dataset depicted the in following image:

What a 3D Environment dataset (or 3D place dataset) is all about is making 3D reconstructions the basic primitive of research.  This means that an actual 3D reconstruction algorithm has to first be ran to create dataset.  This is a fairly new idea in the Computer Vision community.  The paper which introduces such a dataset, SUN3D, was introduced at this year's ICCV 2013 conference.  I briefly outlined the paper in my ICCV 2013 summary blog post, but I felt that this topic is worthy of its own blog post.  For those interested, the paper link is below:

J. Xiao, A. Owens and A. Torralba SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels Proceedings of 14th IEEE International Conference on Computer Vision (ICCV2013). paper link

Running a 3D reconstruction algorithm is no easy feat, so Xiao et al. found that some basic polygon-level annotations were sufficient for snapping Structure from Motion algorithms into place.  For those of you that don't what know a Structure from Motion (SfM) algorithm is, it is a process which reconstructs the 3D locations of points inside images (the structure) as well as the camera parameters (the motion) for a sequence of images.  Xiao et al.'s SfM algorithm uses the depth data from a Kinect sensor in addition to the manually provided object annotations. Checkout their video below:

Depth dataset vs SUN 3D dataset
The NYU2 Depth dataset is useful for studying object detection algorithms which operate on 2.5D images while MIT's SUN 3D dataset is useful for contextual reasoning and object-object relationships.  This is important because Kinect images do not give full 3D, they merely return a 2.5D "depth image" from which certain physical relationships cannot be easily inferred.

Startups doing 3D
It is also worthwhile pointing out that Matterport, a new startup, is creating their own sensors and algorithms for helping people create their own 3D reconstructions. Check out their vide below:

What this means for the rest of us
We should expect the next generation of smartphones to have their own 3D sensors.  In addition, we should expect the new generation of wearable devices such as Google Glass to give us more than 2D reasoning, they should be able to use this 3D data to make better visual inferences.  I'm glad to see 3D getting more and more popular as this allows researchers to work on new problems, new data structures, and push their creativity to the next level!

Wednesday, December 25, 2013

My kickstarter explained using only the 1000 most used words

A wordle using all words from the VMX Project Kickstarter original text

Describing new technology without sounding intimidating is a daunting task, especially when you have a kickstarter about emerging technology which intersects the fields of artificial intelligence and computer vision!  Above you can see a Wordle (a visual collection of most frequently used words, size proportional to frequency) I made from the original text of my Kickstarter named VMX Project: Computer Vision for Everyone. Below, I summarize the Kickstarter using only the 1000 most used words. Below, I wrote a summary using this reduced subset of English words. To quickly explain: the Up-Goer Fixe Text Editor (inspired by this XKCD) lets you type some text and makes sure you don't use uncommon words. Here are just a few of the words are banned from the summary: vision, code, product, item, detect, robot, automate, program, machine, automatic, object, and teach. I took a half hour, decided to use the word "VMX" because it is the product name, and this is the summary I came up with.

VMX is all about "computer seeing." To make a computer recognize things like cars, people, faces, hands, dogs, couches, or bottles is not easy. If these things could be recognized in real-time either from a picture or live, we could make new ways of playing games with computers, play different music/shows when different friends' faces are recognized, or control your computer with your hands.

This type of "computer seeing" problem is very hard to try to answer yourself. I have been studying "computer seeing" for 10 years, and made VMX, which is a new way of getting stuff recognized in pictures by computers. You use VMX when you want something recognized in a picture. It is very fast -- you can add new stuff to be recognized in a few minutes. With VMX, you won't have to know how "computer seeing" is used by a computer to recognize things in pictures, and won't have to read hard books or papers. VMX is made for everyone to enjoy "computer seeing" -- It is easy, saves you lots of time, and can be quite fun. You can make VMX recognize and act on the things in your home, your friends' faces, or even body parts like hands. Because VMX is so easy to use, you spend more using "computer seeing" in fun ways and we let you easily share your cool ideas with other people, if and when you want to.

If you give us 100$, we'll let you use VMX early, and you'll get more use time (2x to 3x) than people paying later in the year. For 25$, you'll still get more use time (2x to 3x). The more you help, the better deal on use time you get, and get to play with VMX earlier!

Thanks for for helping us make VMX come to life! Merry Christmas!

[This post was inspired by the Christmas entry from Scott Aaronson's blog, Shtetl-Optimized, called "Merry Christmas! My quantum computing research explained, using only the 1000 most common English words", and it's a great exercise for unlearning the curse of Knowledge!]

Monday, December 23, 2013

VMX: Teach your computer to see without leaving the browser, my Kickstarter project

I’ve spent the last 12 years of my life learning how machines think, and now is time to give a little something back.  I’m not just talking about using computers, nor writing ordinary computer programs.  I’m talking about Robotics, Artificial Intelligence, Machine Learning, and Computer Vision.  Throughout these 12 years, I’ve witnessed how engineers and scientists pursue these problems, at three great universities: RPI, CMU, and MIT.  I’ve been to 11 research conferences, given many talks, wrote and co-wrote many papers, helped teach a few computer vision courses, helped run a few innovation workshops centered around computer vision, and released some open-source computer vision code.

But now, in 2014, most people still struggle with understanding what computer vision is all about and how to get computer vision tools up and running.  I’ve decided that a traditional career in Academia would allow me to motivate no more than a few classrooms of students per year.  A rough estimate of 100 students per year across a 30 year career is a mere 30,000 students.  What about everybody else?  One could argue that some of these students would become educators themselves and the wonderful art of computer vision would reach beyond 30,000.  But I can’t wait.  I don’t want to wait.  Computer vision is too awesome.  I’m too excited.  It's time for everybody to feel this excitement.

So I decided to do something crazy.  Something I wanted to do for a long time, but only recently realized that it would not be possible to do inside the confines of a University.  I recruited the craziest and most bad-ass developer I’ve ever encountered and decided to do the following: convert advanced computer vision technology into a product form that would be so easy to use, a kid without any programming knowledge could train his own object detectors.

I’ve been working non-stop with my colleague and cofounder at our new company,, to bring you the following Kickstarter campaign:

What if your computer was just a little bit smarter? What if it could understand what is going on in its surroundings merely by looking at the world through a camera? Such technology could be used to make games more engaging, our interactions with computers more seamless, and allow computers to automate many of our daily chores and responsibilities. We believe that new technology shouldn’t be about advanced knobs, long manuals, or require domain expertise. 

The VMX project was designed to bring cutting-edge computer vision technology to a very broad audience: hobbyists, researchers, artists, students, roboticists, engineers, and entrepreneurs. Not only will we educate you about potential uses of computer vision with our very own open-source vision apps, but the VMX project will give you all the tools you need to bring your own creative computer vision projects to life.

VMX gives individuals all they need to effortlessly build their very own computer vision applications. Our technology is built on top of 10+ years of research experience acquired from CMU, MIT, and Google. By leaving the hard stuff to us, you will be able to focus on creative uses of computer vision without the headaches of mastering machine learning algorithms or managing expensive computations. You won’t need to be a C++ guru or know anything about statistical machine learning algorithms to start using laboratory-grade computer vision tools for your own creative uses.

In order to make the barrier-of-entry to computer vision as low as possible, we built VMX directly in the browser and made sure that it requires no extra hardware. All you need is a laptop with a webcam and a internet connection. Because browsers such as Chrome and Firefox can read video directly from a webcam, you most likely have all of the required software and hardware. The only thing missing is VMX.

We're truly excited about what is going happen next, but we need your help!  Please spread the word, and if you're even mildly excited about computer vision, consider supporting this project.

Thanks Everyone!
Tomasz, @quantombone, author of tombone's computer vision blog

P.S. I'm not telling you what VMX stands for...

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere.  Detectors are not only getting faster, but getting stylish.  Edges are making a comeback.  HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level CategoriesVicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference.  It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection.  However this paper is not just about edges.  Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees."  Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project.  It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above.  There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and TimeYong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space.  One great example is how the appearance of cars has changed over the past several decades.  Another example is how typical Google Street View images change as a function of going North-to-South in the United States.  Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros.  I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013.

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper.  A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about.  This is machine learning, this is AI, this is the future.  None of this train on my favourite dataset, test on my favourite dataset bullshit.  If there's anybody that's going to do it the right way, its the CMU gang.  This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning.  The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier?  Then this paper is for you.  Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection.  The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure.  Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression.  Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010.  A recent blog post from Piotr Doll├ír (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data.  Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation.  Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues.  This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes.  While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs.  Aditya KhoslaWilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos?  Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds.  Well, now there's an algorithm for that.  Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013.

Xiao et al, continue their hard-core data collection efforts.  Now in 3D.  In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers.  Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle.  In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place.  And not any sort of crazy, click-here, click-there interfaces.  Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic.  For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao.  Congratulations on the new faculty position!  Princeton has been starving for a person like you.  If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly.  Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars,  ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list.  For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.

Sunday, December 01, 2013

What my mother taught me about computer vision

“Wake up, Tomek.  Pack your bags.  We’re moving to America.” 

These were the words my mother whispered into my ear as she roused me from a deep sleep.  There was no alarm clock and no preparation (at least not on my behalf). I was eight years old, and it was a typical January morning in Poland.  It was 1992, and beside a brief venture into Czechoslovakia a few years earlier, I had never left Poland before.

I can still remember those words like they were uttered yesterday.  I remember both the comfort of a child being woken up by the reassuring words of one’s mother as well as the excitement of what those words meant.  It was a matter of hours until I would experience my first international flight, my first multi-lane highway, my first supermarket, and get my first dose of American television.

What I learned from my mother is that sometimes, you just have to pack your bags and go.  That is the lesson my mother taught me, and it wasn’t delivered in the form of a university lecture.  It was an action.  An action that would be the single most influential event in my life.  Moving to the Land of Opportunity from Poland wasn’t something you could not be excited about.

There is a certain kind of excitement that occurs when you make such a bold move in your life.  It requires a certain kind of courage, a certain kind of entrepreneurial spirit.  A certain vision for the future and a certain willingness to take a calculated risk.  A vision that might be filled with uncertainty, but when the uncertainty is drowned by hope, any residual fear just melts away.

My mother never taught me anything about quantum mechanics.  She never provided me with extra tutors hat would one day help me get into a good college, no guidance on how to get into a great PhD program, no etiquette lessons on how to become a respected scientist, etc.  But she gave me the courage and confidence to know that if you want something in life and you have the willingness to pursue it, you can get it. The courage that my mother's actions instilled in me have been more influential in my personal development than any single formal source of knowledge so far.  Thanks mom.

Computer vision is all about the future.  It is all about risks.  It requires a certain entrepreneurial spirit that cannot be attained within the comfy confines of the ivory tower.  I see a world where the way we interact with machines is drastically different than today.  I see a future where we are no longer slaves to our smartphones, where automation will allow us to embrace our human side.  A future where technology will allow us to be free from the worries and stresses which saturate contemporary life.  Computer vision is the interface of the future.  It will allow for both machines to make sense of the world around them, and for us to interact with these machines in a much more intuitive way.

But this sort of change cannot happen without a change in attitude.  As of December 2013, computer vision is simply too academic.  Too much mathematics simply for the sake of mathematics.  Too much emphasis on advancing the state-of-the-art by writing esoteric papers and competing on silly benchmarks.  As a community we have made tremendous advancements, but we have to take more risks.  We have to let go of our egos, and stop worrying about our individual resumes.

I no longer believe that the sort of change I want to see in the world is going to happen by itself.  I want computer vision to revolutionize the way we interact with computers.  I believe in Computer Vision the same way I believed (and still do) about America. Computer vision is the technology of the future, it is the technology of opportunity.  But this cannot happen as long as I continue to portray myself as solely an academic figure.  I know that the way I’m approaching life now is much riskier than getting a traditional job/career in the sciences.  It’s strange to admit that my last day at MIT has been much more exciting for me than my first day at MIT.  I am excited.  My fledgling team is excited.  After our product launch, we’re hoping you will participate in our excitement.  I think the fun times are only beginning. The only limits we have are the ones we impose upon ourselves. 

“Wake up computer vision.  Pack your bags.  You’re moving into everyone’s home.” 

Monday, October 28, 2013

Just Add Vision: Turning Computers Into Robots

The future of technology is all about improving the human experience. And the human experience is all about you -- you filling your life with less tedious work, more fun, less discomfort, and more meaningful human interactions. Whether new technology will let us enjoy life more during our spare time (think of what big screen TVs did for entertainment), or, let us become more productive at work (think of what calculators did for engineers), successful technologies have the tendency to improve our quality of life. 

Let’s take a quick look at how things got started... 

IBM started a chain of events by building affordable computers for small businesses to increase their productivity. Microsoft and Apple then created easy-to-use operating systems which allowed the common man to use computers at home for both entertainment (computer games) and being more productive (MS Office). Once personal computers started entering our homes, it was only a matter of a years until broadband internet access become widespread. Google then came along and changed the way we retrieve information from the internet while Social networking redefined how we interact with the people in our lives. Let's not forget modern smartphones, which let us use all of this amazing technology while on the go! 

Surely our iPhones will get faster and smaller while Google search will become more robust, but does the way we interact with these devices have to stay the same? And will these devices always do the same things? 

Computers without keyboards 
A lot of the world’s most exciting technology is designed to be used directly by people and ceases to provide much value once we stop directly interacting with our devices. I honestly believe that instead of wearing more computing devices (such as Google Glass) and learning new iOS commands, what we need is technology that can do useful things on its own, without requiring a person to hit buttons or custom keyboards. Because doing useful things entails having some sort of computational unit inside, it is fair to think of these future devices as “computers.” However, making computers do useful things on their own requires making machines intelligent, something which is yet to reach the masses, so I think a better name for these devices is robots. 

What is a robot? 
If we want machines to help us out in our daily tasks (e.g., cleaning, cooking, driving, playing with us, teaching us) we need machines that can both perceive their immediate environment and act intelligently. The perception-and-action loop is all that is necessary in order to turn everyday computers into intelligent robots. While it would be “nice” to build humanoid robots which look like this: 

In my opinion, a robot is any device capable of executing its own perception and action loop. Thus, it is not necessary to have full-fledged humanoid robots to start reaping the benefit of consumer-robotics in-home robotics. Once we stop looking for smart machines with legs, and broaden our definition of a robot, it is easy to tell that the revolution has already begun. 

Current desktop computers and laptops, which require input in the form of a key being pressed or a movement on the trackpad, can be viewed as semi-intelligent machines -- but because the input interfaces render the perception problem unnecessary, I do not consider them full-fledged robots. However, an iPhone running Siri is capable of sending a text message to one of our contacts via speech, so to some extent I consider Siri-enabled iPhones as robots. Tasks such as cleaning cannot be easily automated using Siri because no matter how dirty a floor is, it will never exclaim, “I’m dirty, please clean me!”. What we need is the ability for our devices to see -- namely, recognize objects in the environment (is this a sofa or a chair?), infer their state (clean vs. dirty), and track their spatial extent in the environment (these pixels belong to the plate). 

Just add vision
We have spent decades using keyboards and mice, essentially learning a machine-specific language between us and machines. Whether you consider keystrokes as a high-level or low-level language is besides the point -- it is still a language, and more specifically a language which requires inputting everything explicitly. If we want machines to effortlessly interact with the world, we need to teach them our language and let them perceive the world directly. With the current advancements in computer vision, this is becoming a reality. But the world needs more visionary thinkers to become computer vision experts, more vision experts to start caring about broader uses of their technology, more everyday programmers to use computer vision in their projects, and more expert-grade computer vision tools accessible to those just starting out. Only then, will we be able to pool our collective efforts and finally interweave in-home robotics with the everyday human experience. 

What's next?
Wouldn’t it be great if we had a general-purpose machine vision API which would render the most tedious and time-consuming part of training object detectors obsolete? Wouldn't it be awesome if we could all use computer vision without becoming mathematics gurus or having years of software engineering experience?  Well, this might be happening sooner than you think.  In an upcoming blog post, I will describe what this API is going to look like and why it’s going to make your life a whole lot easier.  I promise not to disappoint...

Friday, September 27, 2013

Teaching Computer Vision Innovation

Tinker. Reason. Experiment. Innovate.

This past August I was involved with the MIT Skoltech Innovation Workshop, the second edition of the successful 2012 workshop.  The motto of the workshop was "Tinker. Reason. Experiment. Innovate." and the goal was to get students accustomed to thinking about innovation in an entirely new way.

“Innovating requires behavioral change. This year students have taught us that we can reliably reproduce the experience we created last year, to effect the change in behavior needed to adopt innovation as a way of thinking,” said Dr. Perez-Breva, PhD, who conceived the workshop and directed both editions.

This was the second time I helped out with the workshop, and just like last year I played the role of a technical advisor for several computer vision-focused teams.  As an aspiring computer vision entrepreneur, the experience of instilling each team with expert computer vision and machine learning knowledge was invaluable.  This gave me an opportunity to better understand how a younger generation of students thinks about computer vision as well as gauge their ability to get technology up and running.  Overall I was impressed with student progress over the short final project period and I learned several important lessons which I want to share with you today.

1.) There is a high barrier to entry when it comes to using state of the art object detections.  Many great tools have been produced by the research community and work quite well on standard datasets straight out of the box.  Unfortunately, true innovation requires utilizing computer vision techniques in novel scenarios -- scenarios for which new datasets must be created.  The entire process of creating an object dataset, image labeling, and preprocessing is not as straightforward as you might think.  This means that students lost valuable time just on creating the right input for their object detection systems.  Training object detectors should be easier.  Because iteration is inevitable in most successful projects, we need a faster way of building vision-enabled apps.  I wish there was an interactive and real-time way of training object detectors.

2.) Innovation is all about seeing opportunities where others either see problems or nothing at all.  As a technology advisor I had to remind the students that their goal was to think about product which could change the world in a positive way.  Once a team devised a product for a specific market, I would help with the technology by acting as a hired expert.  Only after the teams were able to generate ideas I would discuss the feasibility of their proposed solution and help them brainstorm ways of improving the underlying technology. This sort of open-ended thinking is generally not taught in a classroom -- where the goal is to work on predefined problems.  As an innovator, you have to come up with the problem, the solution, and convince a broader audience that your solution novel and likely to succeed.  Homework, with its predefined trajectory for success, is the antithesis of innovation. Too much homework and students get accustomed to being assigned defined tasks.  What this means is that to engender a new generation of entrepreneurs we have to give them fewer predefined problem sets, and more open-ended team-based projects.

3.) The project teams responded very well to my enthusiasm and positive reinforcement.  I enjoy working with teams, but in all of my school projects I have always taken the role of lead engineer.  The innovation workshop was my opportunity to work as a coach. I found that I love working with ambitious and talented teams, as they have not yet been perverted by the overly-refined tastes of academics, but are filled with their own dreams of changing the world.  During a one week intense development period, I rationed my time to 30 minutes per team per day.  I did not answer questions over email, and used a timer to make sure each time only got 30 minutes.  I told each team that I could help them with any technical issue they had -- ranging from C++ linking errors to brainstorming about a new machine learning algorithm on the board -- but that it was up to them to decide how to best make use of my time.  This helped each team better utilize a technical advisor's time, as such advising sessions are quite expensive for in the real-world (i.e., in startups).  The teams treated me with a high degree of professionalism, and they were quick to realize on their own that it was not worth using me for low-level coding questions.  It doesn't take much to get me excited about computer vision, and I feel that after most of the advising sessions I was able to significantly raise team spirit.  When teaching or advising, it is important to have the students feel positive about their work after each meeting.  For every discouraging word, it is wise to sprinkle in encouragement and an overall positive attitude towards student progress.  If you're going to teach, provide enough positive encouragement so your students leave your office so energized that they are dying to get back to their research projects.  I have found this teaching innovation experience to be more fulfilling than any of my other teaching experiences to date.

Overall, teaching innovation has helped me realize what is missing in the world of PhD Academic research.  There is big difference between pure research and true innovation, and while those skills are both instrumental in startup success, there is a big difference between being a die-hard researcher and a die-hard entrepreneur.  I now know what my education has been missing. Thank you MIT SkTech Innovation Workshop for the wonderful experience, and helping me refine certain skills which will likely make more more valuable during my own entrepreneurial ventures.

Related Content:

Wednesday, July 03, 2013

[CVPR 2013] Three Trending Computer Vision Research Areas

As I walked through the large poster-filled hall at CVPR 2013, I asked myself, “Quo vadis Computer Vision?" (Where are you going, computer vision?)  I see lots of papers which exploit last year’s ideas, copious amounts of incremental research, and an overabundance of off-the-shelf computational techniques being recombined in seemingly novel ways.  When you are active in computer vision research for several years, it is not rare to find oneself becoming bored by a significant fraction of papers at research conferences.  Right after the main CVPR conference, I felt mentally drained and needed to get a breath of fresh air, so I spent several days checking out the sights in Oregon.  Here is one picture -- proof that the CVPR2013 had more to offer than ideas!

When I returned from sight-seeing, I took a more circumspect look at the field of computer vision.  I immediately noticed that vision research is actually advancing and growing in a healthy way.  (Unfortunately, most junior students have a hard determining which research papers are actually novel and/or significant.)  A handful of new research themes arise each year, and today I’d like to briefly discuss three new computer vision research themes which are likely to rise in popularity in the foreseeable future (2-5 years).

1) RGB-D input data is trending.  

Many of this year’s papers take a single 2.5D RGB-D image as input and try to parse the image into its constituent objects.  The number of papers doing this with RGBD data is seemingly infinite.  Some other CVPR 2013 approaches don’t try to parse the image, but instead do something else like: fit cuboids, reason about affordances in 3D, or reason about illumination.  The reason why such inputs are becoming more popular is simple: RGB-D images can be obtained via cheap and readily available sensors such as Microsoft’s Kinect.  Depth measurements used to be obtained by expensive time of flight sensors (in the late 90s and early 00s), but as of 2013, $150 can buy you one these depth sensing bad-boys!  In fact, I had bought a Kinect just because I thought that it might come in handy one day -- and since I’ve joined MIT, I’ve been delving into the RGB-D reconstruction domain on my own.  It is just a matter of time until the newest iPhone has an on-board depth sensor, so the current line of research which relies on RGB-D input is likely to become the norm within a few years.

2) Mid-level patch discovery is a hot research topic.
Saurabh Singh from CMU introduced this idea in his seminal ECCV 2012 paper, and Carl Doersch applied this idea to large-scale Google Street-View imagery in the “What makes Paris look like Paris?” SIGGRAPH 2012 paper.  The idea is to automatically extract mid-level patches (which could be objects, object parts, or just chunks of stuff) from images with the constraint that those are the most informative patches.  Regarding the SIGGRAPH paper, see the video below.

Unsupervised Discovery of Mid-Level Discriminative Patches Saurabh Singh, Abhinav Gupta, Alexei A. Efros. In ECCV, 2012.

Carl DoerschSaurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What Makes Paris Look like Paris? In SIGGRAPH 2012. [pdf]

At CVPR 2013, it was evident that the idea of "learning mid-level parts for scenes" is being pursued by other top-tier computer vision research groups.  Here are some CVPR 2013 papers which capitalize on this idea:

Blocks that Shout: Distinctive Parts for Scene Classification. Mayank Juneja, Andrea Vedaldi, CV Jawahar, Andrew Zisserman. In CVPR, 2013. [pdf]

Representing Videos using Mid-level Discriminative Patches. Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry Davis. CVPR, 2013. [pdf]

Part Discovery from Partial Correspondence. Subhransu Maji, Gregory Shakhnarovich. In CVPR, 2013. [pdf]

3) Deep-learning and feature learning are on the rise within the Computer Vision community.
It seems that everybody at Google Research is working on Deep-learning.  Will it solve all vision problems?  Is it the one computational ring to rule them all?  Personally, I doubt it, but the rising presence of deep learning is forcing every researcher to brush up on their l33t backprop skillz.  In other words, if you don't know who Geoff Hinton is, then you are in trouble.

Wednesday, June 26, 2013

[Awesome@CVPR2013] Scene-SIRFs, Sketch Tokens, Detecting 100,000 object classes, and more

I promised to blog about some more exciting papers at CVPR 2013, so here is a short list of a few papers which stood out.  This list also include this year's award winning paper: Fast, Accurate Detection of 100,000 Object Classes on a Single Machine.  Congrats Google Research on the excellent paper!

This paper uses ideas from Abhinav Gupta's work on 3D scene understanding as well as Ali Farhadi's work on visual phrases; however, it also uses RGB-D input data (like many other CVPR 2013 papers).

W. Choi, Y. -W. Chao, C. Pantofaru, S. Savarese. "Understanding Indoor Scenes Using 3D Geometric Phrases" in CVPR, 2013. [pdf]

This paper shows a uses the crowd to learn which parts of birds are useful for fine-grained categorization.  If you work on fine-grained categorization or run experiments with MTurk, then you gotta check this out!
Fine-Grained Crowdsourcing for Fine-Grained Recognition. Jia Deng, Jonathan Krause, Li Fei-Fei. CVPR, 2013. [ pdf ]

This paper won the best paper award.  Congrats Google Research!

Fast, Accurate Detection of 100,000 Object Classes on a Single Machine. Thomas Dean, Mark Ruzon, Mark Segal, Jon Shlens, Sudheendra Vijayanarasimhan, Jay Yagnik. CVPR, 2013 [pdf]

The following is the Scene-SIRFs paper, which I thought was one of the best papers at this year's CVPR.  The ideas to to decompose an input image into intrinsic images using Barron's algorithm which was initially shown to work on objects, but now is being applied to realistic scenes.

Intrinsic Scene Properties from a Single RGB-D Image. Jonathan T. Barron, Jitendra Malik. CVPR, 2013 [pdf]

This is a graph-based localization paper which uses a sort of "Visual Memex" to solve the problem.
Graph-Based Discriminative Learning for Location Recognition. Song Cao, Noah Snavely. CVPR, 2013. [pdf]

This paper provides an exciting new way of localizing contours in images which is orders of magnitude faster than the gPb.  There is code available, so the impact is likely to be high.

Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection. Joseph J. Lim, C. Lawrence Zitnick, and Piotr Dollar. CVPR 2013. [ pdf ] [code@github]

Friday, June 21, 2013

[Awesome@CVPR2013] Image Parsing with Regions and Per-Exemplar Detectors

I've been making an inventory of all the awesome papers at this year's CVPR 2013 conference, and one which clearly stood out was Tighe & Lazebnik's paper titled:

This paper combines ideas from segmentation-based "scene parsing" (see the below video for the output of their older ECCV2010 SuperParsing system) as well as per-exemplar detectors (see my Exemplar-SVM paper, as well as my older Recognition by Association paper).  I have worked and published in these two separate lines of research, so when I tell you that this paper is worthy of reading, you should at least take a look.  Below I outline the two ideas which are being synthesized in this paper, but for all details you should read their paper (PDF link).  See the overview figure below:

Idea #1: "Segmentation-driven" Image Parsing
The idea of using bottom-up segmentation to parse scenes is not new.  Superpixels (very small segments which are likely to contain a single object category) coupled with some machine learning can be used to produce a coherent scene parsing system; however, the boundaries of objects are not as precise as one would expect.  This shortcoming stems from the smoothing terms used in random field inference and because generic category-level classifiers have a hard time reasoning about the extent of an object.  To see how superpixel-based scene parsing works, check out the video from their older paper from ECCV2010:

Idea #2: Per-exemplar segmentation mask transfer
For me, the most exciting thing about this paper is the integration of the segmentation mask transfer from exemplar-based detections.  The ideas is quite simple: each detector is exemplar-specific and is thus equipped with its own (precise) segmentation mask.  When you produce detections from such exemplar-based systems, you can immediately transfer segmentations in a purely top-down manner.  This is what I have been trying to get people excited about for years!  Congratulations to Joseph Tighe for incorporating these ideas into a full-blow image interpretation system.  To see an example of mask transfer, check out the figure below.

Their system produces a per-pixel labeling of the input image, and as you can see below, the results are quite good.  Here are some more outputs of their system as compared to solely region-based as well as solely detector-based systems.  Using per-exemplar detectors clearly complements superpixel-based "segmentation-driven" approaches.

This paper will be presented as an oral in the Orals 3C session called "Context and Scenes" to be held on Thursday, June 27th at CVPR 2013 in Portland, Oregon.

Tuesday, June 18, 2013

Must-see Workshops @ CVPR 2013

June is that wonderful month during which computer vision researchers, students, and entrepreneurs go to CVPR -- the premier yearly Computer Vision conference.  Whether you are presenting a paper, learning about computer vision, networking with academic colleagues, looking for rock-star vision experts to join your start-up, or looking for rock-star vision start-ups to join, CVPR is where all of the action happens!  If you're not planning on going, it is not too late! The Conference starts next week in Portland, Oregon.

There are lots of cool papers at CVPR, many which I have already studied in great detail, and many others which I will learn about next week.  I will write about some of the cool papers/ideas I encounter while I'm at CVPR next week.  In addition to the main conference, CVPR has 3 action-packed workshop days.  I want to take this time to mention two super-cool workshops which are worth checking out during CVPR 2013.  Workshop talks are generally better than the main conference talks, since the invited speakers tend to be more senior and they get to present a broader view of their research (compared to the content of a single 8-page research paper as is typically discussed during the main conference).

SUNw: Scene Understanding Workshop
Sunday June 23, 2013

From the webpage: Scene understanding started with the goal of building machines that can see like humans to infer general principles and current situations from imagery, but it has become much broader than that. Applications such as image search engines, autonomous driving, computational photography, vision for graphics, human machine interaction, were unanticipated and other applications keep arising as scene understanding technology develops. As a core problem of high level computer vision, while it has enjoyed some great success in the past 50 years, a lot more is required to reach a complete understanding of visual scenes.

I attended some the other SUN workshops which were held at MIT during the winter months.  This time around, the conference is at CVPR, so by definition it will be accessible to more researchers.  Even though I have the pleasure of knowing personally the super-smart workshop organizers (Jianxiong Xiao, Aditya Khosla, James Hays, and Derek Hoiem), the most exciting tidbit about this workshop is the all-star invited speaker schedule.  The speakers include: Ali Farhadi, Yann LeCun, Fei-Fei Li, Aude Oliva, Deva Ramanan, Silvio Savarese, Song-Chun Zhu, and Larry Zitnick.  To hear some great talks and hear about truly bleeding-edge research by some of vision's most talented researchers, come to SUNw.

VIEW 2013: Vision Industry and Entrepreneur Workshop
Monday, June 24, 2013

From the webpage: Once largely an academic discipline, computer vision today is also a commercial force. Startups and global corporations are building businesses based on computer vision technology. These businesses provide computer vision based solutions for the needs of consumers, enterprises in many commercial sectors, non-profits, and governments. The demand for computer vision based solutions is also driving commercial and open-source development in associated areas, including hardware and software platforms for embedded and cloud deployments, new camera designs, new sensing technologies, and compelling applications. Last year, we introduced the IEEE Vision Industry and Entrepreneur Workshop (VIEW) at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) to bridge the academic and commercial worlds of computer vision. 

I include this workshop in the must-see list because the time is right for Compter Vision researchers to start innovating at start-ups.  First of all, the world wants your vision-based creations today.  With the availability of smart phones and widespread broadband access, the world does not want to wait a decade until the full academic research pipeline gets translated into products.  Seeing such workshops at CVPR is exciting, because this will help breed a new generation of researcher/entrepreneur.  I, for one, welcome our new company-starting computer vision overlords.

Sunday, April 21, 2013

International Conference of Computational Photography 2013 (ICCP 2013) Day 1 recap

Yesterday was the first day of ICCP 2013.  While the conference should have started Friday, it was postponed until Saturday due to the craziness in Boston.  Nevertheless, it was an excellent day of mingling with colleagues and listening to talks/posters.  Here are some noteworthy items from Saturday:

Marc Levoy (from Stanford and Google) gave a keynote about Google Glass and how it will change the game of photography and video collection.  Marc was one of three Googlers wearing Glass.  The other two were Sam Hasinoff (former MIT CSAILer) and Peyman Milanfar (from UCSC/Google).  I had the privilege of chatting with Prof. Milanfar during Saturday night's reception at the Harvard Faculty club and got to share my personal views on what Glass means for Robotics researchers like myself.

Marc Levoy at ICCP 2013

During his presentation, Matthias Grundman from Georgia Tech talked about his work on radiometric self-calibration of videos and the implications of his work for visual object recognition from YouTube videos is fairly evident.  In other words, why have your machine learning algorithm deal with a source of appearance variations due to the imaging process if it can be removed!
Matthias Grundman at ICCP 2013

Post-processing Approach for Radiometric Self-Calibration of Video. Matthias Grundmann (Georgia Tech), Chris McClanahan (Georgia Tech), Sing Bing Kang (Microsoft Research), Irfan Essa (Georgia Tech). ICCP 2013

Hany Farid from Dartmouth University presented an excellent keynote on Image Forensics.  Image manipulators beware!  His work is not going to make image forgery impossible, but it will take it out of the hands of amateurs.
Hany Farid at ICCP 2013

The best paper award was given to the following paper:
"3Deflicker from Motion" by Yohay Swirski (Technion), Yoav Schechner (Technion)

Good job Yohay and Yoav!

Finally, we (the MIT object detection hackers) will be setting up our own wearable computing platform, the HOGgles box, for the Demo session during lunch.  Carl Vondrick, Aditya Khosla, and I will also be there during the coffee breaks after lunch with the HOGgles demo.

Today should be as much as yesterday and I will try to upload some videos of HOGgles in action later tonight.

Friday, April 19, 2013

Can you pass the HOGgles test? Inverting and Visualizing Features for Object Detection

Despite more than a decade of incessant research by some of the world's top computer vision researchers, we still ask ourselves "Why is object detection such a difficult problem?"

Surely, better features, better learning algorithms, and better models of visual object categories will result in improved object detection performance.  But instead of waiting an indefinite time until the research world produces another Navneet Dalal (of HOG fame) or Pedro Felzenszwalb (of DPM fame), we (the vision researchers in Antonio Torralba's lab at MIT) felt the time was ripe to investigate object detection failures from an entirely new perspective.

When we (the researchers) look at images, the problem of object detection appears trivial; however, object detection algorithms don't typically analyze raw pixels, they analyze images in feature spaces!  The Histogram of Oriented Gradients feature (commonly known as HOG) is the de-facto standard in object detection these days.  While looking at gradient distributions might make sense for machines, we felt that these features were incomprehensible to the (human) researchers who have to make sense of object detection failures.  Here is a motivating quote from Marcel Proust (a French novelist), which most accurately describes what we did:

The real voyage of discovery consists not in seeking new landscapes but in having new eyes.” -- Marcel Proust

In short, we built new eyes.  These new "eyes" are a method for converting machine readable features into human readable RGB images.  We take a statistical machine learning approach to visualization -- we learn how to invert HOG using ideas from sparse coding and large-scale dictionary learning.  Let me briefly introduce the concept of HOGgles (i.e., HOG glasses).

Taken from Carl Vondrick's project abstract:
We present several methods to visualize the HOG feature space, a common descriptor for object detection. The tools in this paper allow humans to put on "HOG glasses" and see the visual world as a computer might see it.

Here is an example of a short video (movie trailer for Terminator) which shows the manually engineered HOG visualization (commonly know as the HOG glyph), the original image, and our learned iHOG visualization.

We are presenting a real-time demo of this new and exciting line of work at the 2013 International Conference of Computational Photography (ICCP2013) which is being held at Harvard University this weekend (4/19/2013 - 4/21/2013).  If you want to try our sexy wearable platform and become a real-life object detector for a few minutes, then come check us out at this Sunday morning's demo session at ICCP2013 at Harvard University.

Also, if you thought TorralbaArt was cool, you must check out VondrickArt (a result of trying to predict color using the iHOG visualization framework)

Project-related Links:

Project website:
Project code (MATLAB-based) on Github:
arXiv paper:

Authors' webpages:

Carl Vondrick (MIT PhD student):
Aditya Khosla (MIT PhD student):
Tomasz Malisiewicz (MIT Postdoctoral Fellow):
Antonio Torralba (MIT Professor):

We hope that with these new eyes, we (the vision community) will better understand the failures and successes of machine vision systems.  I, for one, welcome our new HOGgles wearing overlords.

UPDATE: We will be present the paper at ICCV 2013.  An MIT News article covering the research can be found here: