Tombone's Computer Vision Blog: December 2013

Friday, December 27, 2013

Understanding the Visual World, one 3D reconstruction at a time

The first generation of datasets in the computer vision community were just plain old images -- simple arrays of pixels. Seem like nothing fancy, but we must recall that there was a time where a single image could barely fit inside a computer's memory. During this early time, researchers showcased their image processing algorithms on the infamous Lenna image. But later we saw datasets like the Corel dataset, Caltech 101, LabelMe, SUN, James Hays' 6 million Flick images, PASCAL VOC, and Image Net. These were impressive collections of images and for the first time computer vision researchers did not have to collect their own images. As Jitendra Malik once said, large annotated datasets marked the end of the "Wild Wild West" in Computer Vision -- for the first time, large datasets allowed researchers to compare object recognition algorithms on the same sets of images! What is different about these datasets is that some come with annotations at the image level, some come with annotated polygons, and some come with nothing more than objects annotated at the bounding box level. Images are captured by a camera and annotations are produced by a human annotation effort. But these traditional vision datasets lack depth, 3D information, or anything of that sort. LabelMe3D was an attempt at reconstructing depth from object annotations, but it would only work in a pop-up world kind of way.

The next generation of datasets is all about going into 3D. But not just annotated depth images like the NYU2 Depth Dataset depicted the in following image:

What a 3D Environment dataset (or 3D place dataset) is all about is making 3D reconstructions the basic primitive of research. This means that an actual 3D reconstruction algorithm has to first be ran to create dataset. This is a fairly new idea in the Computer Vision community. The paper which introduces such a dataset, SUN3D, was introduced at this year's ICCV 2013 conference. I briefly outlined the paper in my ICCV 2013 summary blog post, but I felt that this topic is worthy of its own blog post. For those interested, the paper link is below:

J. Xiao, A. Owens and A. Torralba SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels Proceedings of 14th IEEE International Conference on Computer Vision (ICCV2013). paper link

Running a 3D reconstruction algorithm is no easy feat, so Xiao et al. found that some basic polygon-level annotations were sufficient for snapping Structure from Motion algorithms into place. For those of you that don't what know a Structure from Motion (SfM) algorithm is, it is a process which reconstructs the 3D locations of points inside images (the structure) as well as the camera parameters (the motion) for a sequence of images. Xiao et al.'s SfM algorithm uses the depth data from a Kinect sensor in addition to the manually provided object annotations. Checkout their video below:

Depth dataset vs SUN 3D dataset
The NYU2 Depth dataset is useful for studying object detection algorithms which operate on 2.5D images while MIT's SUN 3D dataset is useful for contextual reasoning and object-object relationships. This is important because Kinect images do not give full 3D, they merely return a 2.5D "depth image" from which certain physical relationships cannot be easily inferred.

Startups doing 3D
It is also worthwhile pointing out that Matterport, a new startup, is creating their own sensors and algorithms for helping people create their own 3D reconstructions. Check out their vide below:

What this means for the rest of us
We should expect the next generation of smartphones to have their own 3D sensors. In addition, we should expect the new generation of wearable devices such as Google Glass to give us more than 2D reasoning, they should be able to use this 3D data to make better visual inferences. I'm glad to see 3D getting more and more popular as this allows researchers to work on new problems, new data structures, and push their creativity to the next level!

Wednesday, December 25, 2013

My kickstarter explained using only the 1000 most used words

A wordle using all words from the VMX Project Kickstarter original text

Describing new technology without sounding intimidating is a daunting task, especially when you have a kickstarter about emerging technology which intersects the fields of artificial intelligence and computer vision! Above you can see a Wordle (a visual collection of most frequently used words, size proportional to frequency) I made from the original text of my Kickstarter named VMX Project: Computer Vision for Everyone. Below, I summarize the Kickstarter using only the 1000 most used words. Below, I wrote a summary using this reduced subset of English words. To quickly explain: the Up-Goer Fixe Text Editor (inspired by this XKCD) lets you type some text and makes sure you don't use uncommon words. Here are just a few of the words are banned from the summary: vision, code, product, item, detect, robot, automate, program, machine, automatic, object, and teach. I took a half hour, decided to use the word "VMX" because it is the product name, and this is the summary I came up with.

VMX: "Computer seeing" for everyone

VMX is all about "computer seeing." To make a computer recognize things like cars, people, faces, hands, dogs, couches, or bottles is not easy. If these things could be recognized in real-time either from a picture or live, we could make new ways of playing games with computers, play different music/shows when different friends' faces are recognized, or control your computer with your hands.

This type of "computer seeing" problem is very hard to try to answer yourself.  I have been studying "computer seeing" for 10 years, and made VMX, which is a new way of getting stuff recognized in pictures by computers.  You use VMX when you want something recognized in a picture.  It is very fast -- you can add new stuff to be recognized in a few minutes. With VMX, you won't have to know how "computer seeing" is used by a computer to recognize things in pictures, and won't have to read hard books or papers. VMX is made for everyone to enjoy "computer seeing" -- It is easy, saves you lots of time, and can be quite fun.  You can make VMX recognize and act on the things in your home, your friends' faces, or even body parts like hands.  Because VMX is so easy to use, you spend more using "computer seeing" in fun ways and we let you easily share your cool ideas with other people, if and when you want to.

If you give us 100$, we'll let you use VMX early, and you'll get more use time (2x to 3x) than people paying later in the year. For 25$, you'll still get more use time (2x to 3x). The more you help, the better deal on use time you get, and get to play with VMX earlier!

Thanks for for helping us make VMX come to life!  Merry Christmas!

[This post was inspired by the Christmas entry from Scott Aaronson's blog, Shtetl-Optimized, called "Merry Christmas! My quantum computing research explained, using only the 1000 most common English words", and it's a great exercise for unlearning the curse of Knowledge!]

Monday, December 23, 2013

VMX: Teach your computer to see without leaving the browser, my Kickstarter project

I’ve spent the last 12 years of my life learning how machines think, and now is time to give a little something back. I’m not just talking about using computers, nor writing ordinary computer programs. I’m talking about Robotics, Artificial Intelligence, Machine Learning, and Computer Vision. Throughout these 12 years, I’ve witnessed how engineers and scientists pursue these problems, at three great universities: RPI, CMU, and MIT. I’ve been to 11 research conferences, given many talks, wrote and co-wrote many papers, helped teach a few computer vision courses, helped run a few innovation workshops centered around computer vision, and released some open-source computer vision code.

But now, in 2014, most people still struggle with understanding what computer vision is all about and how to get computer vision tools up and running. I’ve decided that a traditional career in Academia would allow me to motivate no more than a few classrooms of students per year. A rough estimate of 100 students per year across a 30 year career is a mere 30,000 students. What about everybody else? One could argue that some of these students would become educators themselves and the wonderful art of computer vision would reach beyond 30,000. But I can’t wait. I don’t want to wait. Computer vision is too awesome. I’m too excited. It's time for everybody to feel this excitement.

So I decided to do something crazy. Something I wanted to do for a long time, but only recently realized that it would not be possible to do inside the confines of a University. I recruited the craziest and most bad-ass developer I’ve ever encountered and decided to do the following: convert advanced computer vision technology into a product form that would be so easy to use, a kid without any programming knowledge could train his own object detectors.

I’ve been working non-stop with my colleague and cofounder at our new company, vision.ai, to bring you the following Kickstarter campaign:

VMX Project: Computer Vision for Everyone

What if your computer was just a little bit smarter? What if it could understand what is going on in its surroundings merely by looking at the world through a camera? Such technology could be used to make games more engaging, our interactions with computers more seamless, and allow computers to automate many of our daily chores and responsibilities. We believe that new technology shouldn’t be about advanced knobs, long manuals, or require domain expertise.

The VMX project was designed to bring cutting-edge computer vision technology to a very broad audience: hobbyists, researchers, artists, students, roboticists, engineers, and entrepreneurs. Not only will we educate you about potential uses of computer vision with our very own open-source vision apps, but the VMX project will give you all the tools you need to bring your own creative computer vision projects to life.

VMX gives individuals all they need to effortlessly build their very own computer vision applications. Our technology is built on top of 10+ years of research experience acquired from CMU, MIT, and Google. By leaving the hard stuff to us, you will be able to focus on creative uses of computer vision without the headaches of mastering machine learning algorithms or managing expensive computations. You won’t need to be a C++ guru or know anything about statistical machine learning algorithms to start using laboratory-grade computer vision tools for your own creative uses.

In order to make the barrier-of-entry to computer vision as low as possible, we built VMX directly in the browser and made sure that it requires no extra hardware. All you need is a laptop with a webcam and a internet connection. Because browsers such as Chrome and Firefox can read video directly from a webcam, you most likely have all of the required software and hardware. The only thing missing is VMX.

We're truly excited about what is going happen next, but we need your help! Please spread the word, and if you're even mildly excited about computer vision, consider supporting this project.

Thanks Everyone!
Tomasz, @quantombone, author of tombone's computer vision blog

P.S. I'm not telling you what VMX stands for...

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere. Detectors are not only getting faster, but getting stylish. Edges are making a comeback. HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level Categories, Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference. It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection. However this paper is not just about edges. Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees." Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project. It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above. There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space. One great example is how the appearance of cars has changed over the past several decades. Another example is how typical Google Street View images change as a function of going North-to-South in the United States. Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros. I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013. www.neil-kb.com

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper. A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about. This is machine learning, this is AI, this is the future. None of this train on my favourite dataset, test on my favourite dataset bullshit. If there's anybody that's going to do it the right way, its the CMU gang. This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning. The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier? Then this paper is for you. Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection. The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure. Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression. Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010. A recent blog post from Piotr Dollár (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data. Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation. Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues. This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes. While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs. Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos? Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds. Well, now there's an algorithm for that. Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013. sun3d.cs.princeton.edu

Xiao et al, continue their hard-core data collection efforts. Now in 3D. In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers. Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle. In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place. And not any sort of crazy, click-here, click-there interfaces. Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic. For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao. Congratulations on the new faculty position! Princeton has been starving for a person like you. If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly. Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars, ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list. For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.

Sunday, December 01, 2013

What my mother taught me about computer vision

“Wake up, Tomek. Pack your bags. We’re moving to America.”

These were the words my mother whispered into my ear as she roused me from a deep sleep. There was no alarm clock and no preparation (at least not on my behalf). I was eight years old, and it was a typical January morning in Poland. It was 1992, and beside a brief venture into Czechoslovakia a few years earlier, I had never left Poland before.

I can still remember those words like they were uttered yesterday. I remember both the comfort of a child being woken up by the reassuring words of one’s mother as well as the excitement of what those words meant. It was a matter of hours until I would experience my first international flight, my first multi-lane highway, my first supermarket, and get my first dose of American television.

What I learned from my mother is that sometimes, you just have to pack your bags and go. That is the lesson my mother taught me, and it wasn’t delivered in the form of a university lecture. It was an action. An action that would be the single most influential event in my life. Moving to the Land of Opportunity from Poland wasn’t something you could not be excited about.

There is a certain kind of excitement that occurs when you make such a bold move in your life. It requires a certain kind of courage, a certain kind of entrepreneurial spirit. A certain vision for the future and a certain willingness to take a calculated risk. A vision that might be filled with uncertainty, but when the uncertainty is drowned by hope, any residual fear just melts away.

My mother never taught me anything about quantum mechanics. She never provided me with extra tutors hat would one day help me get into a good college, no guidance on how to get into a great PhD program, no etiquette lessons on how to become a respected scientist, etc. But she gave me the courage and confidence to know that if you want something in life and you have the willingness to pursue it, you can get it. The courage that my mother's actions instilled in me have been more influential in my personal development than any single formal source of knowledge so far. Thanks mom.

Computer vision is all about the future. It is all about risks. It requires a certain entrepreneurial spirit that cannot be attained within the comfy confines of the ivory tower. I see a world where the way we interact with machines is drastically different than today. I see a future where we are no longer slaves to our smartphones, where automation will allow us to embrace our human side. A future where technology will allow us to be free from the worries and stresses which saturate contemporary life. Computer vision is the interface of the future. It will allow for both machines to make sense of the world around them, and for us to interact with these machines in a much more intuitive way.

But this sort of change cannot happen without a change in attitude. As of December 2013, computer vision is simply too academic. Too much mathematics simply for the sake of mathematics. Too much emphasis on advancing the state-of-the-art by writing esoteric papers and competing on silly benchmarks. As a community we have made tremendous advancements, but we have to take more risks. We have to let go of our egos, and stop worrying about our individual resumes.

I no longer believe that the sort of change I want to see in the world is going to happen by itself. I want computer vision to revolutionize the way we interact with computers. I believe in Computer Vision the same way I believed (and still do) about America. Computer vision is the technology of the future, it is the technology of opportunity. But this cannot happen as long as I continue to portray myself as solely an academic figure. I know that the way I’m approaching life now is much riskier than getting a traditional job/career in the sciences. It’s strange to admit that my last day at MIT has been much more exciting for me than my first day at MIT. I am excited. My fledgling team is excited. After our product launch, we’re hoping you will participate in our excitement. I think the fun times are only beginning. The only limits we have are the ones we impose upon ourselves.

“Wake up computer vision. Pack your bags. You’re moving into everyone’s home.”