Wednesday, January 13, 2016

The Future of Real-Time SLAM and "Deep Learning vs SLAM"

Last month's International Conference of Computer Vision (ICCV) was full of Deep Learning techniques, but before we declare an all-out ConvNet victory, let's see how the other "non-learning" geometric side of computer vision is doing.  Simultaneous Localization and Mapping, or SLAM, is arguably one of the most important algorithms in Robotics, with pioneering work done by both computer vision and robotics research communities.  Today I'll be summarizing my key points from ICCV's Future of Real-Time SLAM Workshop, which was held on the last day of the conference (December 18th, 2015).

Today's post contains a brief introduction to SLAM, a detailed description of what happened at the workshop (with summaries of all 7 talks), and some take-home messages from the Deep Learning-focused panel discussion at the end of the session.

SLAM visualizations. Can you identify any of these SLAM algorithms?

Part I: Why SLAM Matters

Visual SLAM algorithms are able to simultaneously build 3D maps of the world while tracking the location and orientation of the camera (hand-held or head-mounted for AR or mounted on a robot). SLAM algorithms are complementary to ConvNets and Deep Learning: SLAM focuses on geometric problems and Deep Learning is the master of perception (recognition) problems. If you want a robot to go towards your refrigerator without hitting a wall, use SLAM. If you want the robot to identify the items inside your fridge, use ConvNets.

Basics of SfM/SLAM: From point observation and intrinsic camera parameters, the 3D structure of a scene is computed from the estimated motion of the camera. For details, see openMVG website.

SLAM is a real-time version of Structure from Motion (SfM). Visual SLAM or vision-based SLAM is a camera-only variant of SLAM which forgoes expensive laser sensors and inertial measurement units (IMUs). Monocular SLAM uses a single camera while non-monocular SLAM typically uses a pre-calibrated fixed-baseline stereo camera rig. SLAM is prime example of a what is called a "Geometric Method" in Computer Vision. In fact, CMU's Robotics Institute splits the graduate level computer vision curriculum into a Learning-based Methods in Vision course and a separate Geometry-Based Methods in Vision course.

Structure from Motion vs Visual SLAM
Structure from Motion (SfM) and SLAM are solving a very similar problem, but while SfM is traditionally performed in an offline fashion, SLAM has been slowly moving towards the low-power / real-time / single RGB camera mode of operation. Many of the today’s top experts in Structure from Motion work for some of the world’s biggest tech companies, helping make maps better. Successful mapping products like Google Maps could not have been built without intimate knowledge of multiple-view geometry, SfM, and SLAM.  A typical SfM problem is the following: given a large collection of photos of a single outdoor structure (like the Colliseum), construct a 3D model of the structure and determine the camera's poses. The image collection is processed in an offline setting, and large reconstructions can take anywhere between hours and days. 

SfM SoftwareBundler is one of the most successful SfM open source libraries

Here are some popular SfM-related software libraries:

Visual SLAM vs Autonomous Driving
While self-driving cars are one of the most important applications of SLAM, according to Andrew Davison, one of the workshop organizers, SLAM for Autonomous Vehicles deserves its own research track. (And as we'll see, none of the workshop presenters talked about self-driving cars). For many years to come it will make sense to continue studying SLAM from a research perspective, independent of any single Holy-Grail application. While there are just too many system-level details and tricks involved with autonomous vehicles, research-grade SLAM systems require very little more than a webcam, knowledge of algorithms, and elbow grease. As a research topic, Visual SLAM is much friendlier to thousands of early-stage PhD students who’ll first need years of in-lab experience with SLAM before even starting to think about expensive robotic platforms such as self-driving cars.

Google's Self-Driving Car's perception system. From IEEE Spectrum's "How Google's Self-Driving Car Works"

Part II: The Future of Real-time SLAM

Now it's time to officially summarize and comment on the presentations from The Future of Real-time SLAM workshop. Andrew Davison started the day with an excellent historical overview of SLAM called 15 years of vision-based SLAM, and his slides have good content for an introductory robotics course.

For those of you who don’t know Andy, he is the one and only Professor Andrew Davison of Imperial College London.  Most known for his 2003 MonoSLAM system, he was one of the first to show how to build SLAM systems from a single “monocular” camera at a time when just everybody thought you needed a stereo “binocular” camera rig. More recently, his work has influenced the trajectory of companies such as Dyson and the capabilities of their robotic systems (e.g., the brand new Dyson360).

I remember Professor Davidson from the Visual SLAM tutorial he gave at the BMVC Conference back in 2007. Surprisingly very little has changed in SLAM compared to the rest of the machine-learning heavy work being done at the main vision conferences. In the past 8 years, object recognition has undergone 2-3 mini revolutions, while today's SLAM systems don't look much different than they did 8 years ago. The best way to see the progress of SLAM is to take a look at the most successful and memorable systems. In Davison’s workshop introduction talk, he discussed some of these exemplary systems which were produced by the research community over the last 10-15 years: 

  • MonoSLAM
  • PTAM
  • DTAM
  • KinectFusion

Davison vs Horn: The next chapter in Robot Vision
Davison also mentioned that he is working on a new Robot Vision book, which should be an exciting treat for researchers in computer vision, robotics, and artificial intelligence. The last Robot Vision book was written by B.K. Horn (1986), and it’s about time for an updated take on Robot Vision. 

A new robot vision book?

While I’ll gladly read a tome that focuses on the philosophy of robot vision, personally I would like the book to focus on practical algorithms for robot vision, like the excellent Multiple View Geometry book by Hartley and Zissermann or Probabilistic Robotics by Thrun, Burgard, and Fox. A "cookbook" of visual SLAM problems would be a welcome addition to any serious vision researcher's collection.

Related: Davison's 15-years of vision-based SLAM slides

Talk 1: Christian Kerl on Continuous Trajectories in SLAM
The first talk, by Christian Kerl, presented a dense tracking method to estimate a continuous-time trajectory. The key observation is that most SLAM systems estimate camera poses at a discrete number of time steps (either they key frames which are spaced several seconds apart, or the individual frames which are spaced approximately 1/25s apart). 

Continuous Trajectories vs Discrete Time Points. SLAM/SfM usually uses discrete time points, but why not go continuous?

Much of Kerl’s talk was focused on undoing the damage of rolling shutter cameras, and the system demo’ed by Kerl paid meticulous attention to modeling and removing these adverse rolling shutter effects.

Undoing the damage of rolling shutter in Visual SLAM.

Related: Kerl's Dense continous-time tracking and mapping slides.
Related: Dense Continuous-Time Tracking and Mapping with Rolling Shutter RGB-D Cameras (C. Kerl, J. Stueckler, D. Cremers), In IEEE International Conference on Computer Vision (ICCV), 2015. [pdf]

Talk 2: Semi-Dense Direct SLAM by Jakob Engel
LSD-SLAM came out at ECCV 2014 and is one of my favorite SLAM systems today! Jakob Engel was there to present his system and show the crowd some of the coolest SLAM visualizations in town. LSD-SLAM is an acronym for Large-Scale Direct Monocular SLAM. LSD-SLAM is an important system for SLAM researchers because it does not use corners or any other local features. Direct tracking is performed by image-to-image alignment using a coarse-to-fine algorithm with a robust Huber loss. This is quite different than the feature-based systems out there. Depth estimation uses an inverse depth parametrization (like many other SLAM systems) and uses a large number or relatively small baseline image pairs. Rather than relying on image features, the algorithms is effectively performing “texture tracking”. Global mapping is performed by creating and solving a pose graph "bundle adjustment" optimization problem, and all of this works in real-time. The method is semi-dense because it only estimates depth at pixels solely near image boundaries. LSD-SLAM output is denser than traditional features, but not fully dense like Kinect-style RGBD SLAM.

LSD-SLAM in Action: LSD-SLAM generates both a camera trajectory and a semi-dense 3D scene reconstruction. This approach works in real-time, does not use feature points as primitives, and performs direct image-to-image alignment.

Engel gave us an overview of the original LSD-SLAM system as well as a handful of new results, extending their initial system to more creative applications and to more interesting deployments. (See paper citations below)

Related: LSD-SLAM Open-Source Code on github LSD-SLAM project webpage
Related: LSD-SLAM: Large-Scale Direct Monocular SLAM (J. Engel, T. Schöps, D. Cremers), In European Conference on Computer Vision (ECCV), 2014. [pdf] [youtube video]

An extension to LSD-SLAM, Omni LSD-SLAM was created by the observation that the pinhole model does not allow for a large field of view. This work was presented at IROS 2015 (Caruso is first author) and allows a large field of view (ideally more than 180 degrees). From Engel’s presentation it was pretty clear that you can perform ballerina-like motions (extreme rotations) while walking around your office and holding the camera. This is one of those worst-case scenarios for narrow field of view SLAM, yet works quite well in Omni LSD-SLAM.

Omnidirectional LSD-SLAM Model. See Engel's Semi-Dense Direct SLAM presentation slides.

Related: Large-Scale Direct SLAM for Omnidirectional Cameras (D. Caruso, J. Engel, D. Cremers), In International Conference on Intelligent Robots and Systems (IROS), 2015.  [pdf] [youtube video]

Stereo LSD-SLAM is an extension of LSD-SLAM to a binocular camera rig. This helps in getting the absolute scale, initialization is instantaneous, and there are no issues with strong rotation. While monocular SLAM is very exciting from an academic point of view, if your robot is a 30,000$ car or 10,000$ drone prototype, you should have a good reason to not use a two+ camera rig. Stereo LSD-SLAM performs quite competitively on SLAM benchmarks.

Stereo LSD-SLAM. Excellent results on KITTI vehicle-SLAM dataset.

Stereo LSD-SLAM is quite practical, optimizes a pose graph in SE(3), and includes a correction for auto exposure. The goal of auto-exposure correcting is to make the error function invariant to affine lighting changes. The underlying parameters of the color-space affine transform are estimated during matching, but thrown away to estimate the image-to-image error. From Engel's talk, outliers (often caused by over-exposed image pixels) tend to be a problem, and much care needs to be taken to care of their effects.

Related: Large-Scale Direct SLAM with Stereo Cameras (J. Engel, J. Stueckler, D. Cremers), In International Conference on Intelligent Robots and Systems (IROS), 2015.  [pdf] [youtube video]

Later in his presentation, Engel gave us a sneak peak on new research about integrating both stereo and inertial sensors. For details, you’ll have to keep hitting refresh on Arxiv or talk to Usenko/Engel in person. On the applications side, Engel's presentation included updated videos of an Autonomous Quadrotor driven by LSD-SLAM. The flight starts with an up-down motion to get the scale estimate and a free-space octomap is used to estimate the free-space so that the quadrotor can navigate space on its own. Stay tuned for an official publication...
Quadrotor running Stereo LSD-SLAM. 

The story of LSD-SLAM is also the story of feature-based vs direct-methods and Engel gave both sides of the debate a fair treatment. Feature-based methods are engineered to work on top of Harris-like corners, while direct methods use the entire image for alignment. Feature-based methods are faster (as of 2015), but direct methods are good for parallelism. Outliers can be retroactively removed from feature-based systems, while direct methods are less flexible w.r.t. outliners. Rolling shutter is a bigger problem for direct methods and it makes sense to use a global shutter or a rolling shutter model (see Kerl’s work). Feature-based methods require making decisions using incomplete information, but direct methods can use much more information. Feature-based methods have no need for good initialization and direct-based methods need some clever tricks for initialization. There is only about 4 years of research on direct methods and 20+ on sparse methods. Engel is optimistic that direct methods will one day rise to the top, and so am I.

Feature-based vs direct methods of building SLAM systems. Slide from Engel's talk.

At the end of Engel's presentation, Davison asked about semantic segmentation and Engel wondered whether semantic segmentation can be performed directly on semi-dense "near-image-boundary" data.  However, my personal opinion is that there are better ways to apply semantic segmentation to LSD-like SLAM systems. Semi-dense SLAM can focus on geometric information near boundaries, while object recognition can focus on reliable semantics away from the same boundaries, potentially creating a hybrid geometric/semantic interpretation of the image.

Related: Engel's Semi-Dense Direct SLAM presentation slides

Talk 3: Sattler on The challenges of Large-Scale Localization and Mapping
Torsten Sattler gave a talk on large-scale localization and mapping. The motivation for this work is to perform 6-dof localization inside an existing map, especially for mobile localization. One of the key points in the talk was that when you are using traditional feature-based methods, storing your descriptors soon becomes very costly. Techniques such as visual vocabularies (remember product quantization?) can significantly reduce memory overhead, and with clever optimization at some point storing descriptors no longer becomes the memory bottleneck.

Another important take-home message from Sattler’s talk is that the number of inliers is not actually a good confidence measure for camera pose estimation.  When the feature point are all concentrated in a single part of the image, camera localization can be kilometers away! A better measure of confidence is the “effective inlier count” which looks at the area spanned by the inliers as a fraction of total image area.  What you really want is feature matches from all over the image — if the information is spread out across the image you get a much better pose estimate.

Sattler’s take on the future of real-time slam is the following: we should focus on compact map representations, we should get better at understanding camera pose estimate confidences (like down-weighing features from trees), we should work on more challenging scenes (such as worlds with planar structures and nighttime localization against daytime maps).

Mobile Localisation: Sattler's key problem is localizing yourself inside a large city with a single smartphone picture

Related: Scalable 6-DOF Localization on Mobile Devices. Sven Middelberg, Torsten Sattler, Ole Untzelmann, Leif Kobbelt. In ECCV 2014. [pdf]
Related: Torsten Sattler 's The challenges of large-scale localisation and mapping slides

Talk 4: Mur-Artal on Feature-based vs Direct-Methods
Raúl Mur-Artal, the creator of ORB-SLAM, dedicated his entire presentation to the Feature-based vs Direct-method debate in SLAM and he's definitely on the feature-based side. ORB-SLAM is available as an open-source SLAM package and it is hard to beat. During his evaluation of ORB-SLAM vs PTAM it seems that PTAM actually fails quite often (at least on the TUM RGB-D benchmark). LSD-SLAM errors are also much higher on the TUM RGB-D benchmark than expected.

Feature-Based SLAM vs Direct SLAM. See Mur-Artal's Should we still do sparse feature based SLAM? presentation slides

Related: Mur-Artal's Should we still do sparse-feature based SLAM? slides
Related: Monocular ORB-SLAM R. Mur-Artal, J. M. M. Montiel and J. D. Tardos. A versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics. 2015 [pdf]
Related: ORB-SLAM Open-source code on github, Project Website

Talk 5: Project Tango and Visual loop-closure for image-2-image constraints
Simply put, Google's Project Tango is the world' first attempt at commercializing SLAM. Simon Lynen from Google Zurich (formerly ETH Zurich) came to the workshop with a Tango live demo (on a tablet) and a presentation on what's new in the world of Tango. In case you don't already know, Google wants to put SLAM capabilities into the next generation of Android Devices. 

Google's Project Tango needs no introduction.

The Project Tango presentation discussed a new way of doing loop closure by finding certain patters in the image-to-image matching matrix. This comes from the “Placeless Place Recognition” work. They also do online bundle adjustment w/ vision-based loop closure.

Loop Closure inside a Project Tango? Lynen et al's Placeless Place Recognition. The image-to-image matrix reveals a new way to look for loop-closure. See the algorithm in action in this youtube video.

The Project Tango folks are also working on combing multiple crowd-sourced maps at Google, where the goals to combine multiple mini-maps created by different people using Tango-equipped devices.

Simon showed a video of mountain bike trail tracking which is actually quite difficult in practice. The idea is to go down a mountain bike trail using a Tango device and create a map, then the follow-up goal is to have a separate person go down the trail. This currently “semi-works” when there are a few hours between the map building and the tracking step, but won’t work across weeks/months/etc. 

During the Tango-related discussion, Richard Newcombe pointed out that the “features” used by Project Tango are quite primitive w.r.t. getting a deeper understanding of the environment, and it appears that Project Tango-like methods won't work on outdoor scenes where the world is plagued by non-rigidity, massive illumination changes, etc.  So are we to expect different systems being designed for outdoor systems or will Project Tango be an indoor mapping device?

Related: Placeless Place Recognition. Lynen, S. ; Bosse, M. ; Furgale, P. ; Siegwart, R. In 3DV 2014.

Talk 6: ElasticFusion is DenseSLAM without a pose-graph
ElasticFusion is a dense SLAM technique which requires a RGBD sensor like the Kinect. 2-3 minutes to obtain a high-quality 3D scan of a single room is pretty cool. A pose-graph is used behind the scenes of many (if not most) SLAM systems, and this technique has a different (map-centric) approach. The approach focuses on building a map, but the trick is that the map is deformable, hence the name ElasticFusion. The “Fusion” part of the algorithm is in homage to KinectFusion which was one of the first high quality kinect-based reconstruction pipelines. Also surfels are used as the underlying primitives.

Image from Kintinuous, an early version of Whelan's Elastic Fusion.

Recovering light sources: we were given a sneak peak at new unpublished work from Imperial College London / dyson Robotics Lab. The idea is that detecting the light source direction and detecting specularities, you can improve 3D reconstruction results. Cool videos of recovering light source locations which work for up to 4 separate lights.

Related: Map-centric SLAM with ElasticFusion presentation slides
Related: ElasticFusion: Dense SLAM Without A Pose Graph. Whelan, Thomas and Leutenegger, Stefan and Salas-Moreno, Renato F and Glocker, Ben and Davison, Andrew J. In RSS 2015.

Talk 7: Richard Newcombe’s DynamicFusion
Richard Newcombe's (whose recently formed company was acquired by Oculus), was the last presenter.  It's really cool to see the person behind DTAM, KinectFusion, and DynamicFusion now working in the VR space.

Newcombe's Dynamic Fusion algorithm. The technique won the prestigious CVPR 2015 best paper award, and to see it in action just take a look at the authors' DynamicFusion Youtube video.

RelatedDynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time, Richard A. Newcombe, Dieter Fox, Steven M. Seitz. In CVPR 2015. [pdf] [Best-Paper winner]
Related: SLAM++: Simultaneous Localisation and Mapping at the Level of Objects Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H. J. Kelly and Andrew J. Davison (CVPR 2013)
Related: KinectFusion: Real-Time Dense Surface Mapping and Tracking Richard A. Newcombe Shahram Izadi,Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Andrew Fitzgibbon (ISMAR 2011, Best paper award!)

Workshop Demos
During the demo sessions (held in the middle of the workshop), many of the presenter showed off their SLAM systems in action. Many of these systems are available as open-source (free for non-commercial use?) packages, so if you’re interested in real-time SLAM, downloading the code is worth a shot. However, the one demo which stood out was Andrew Davison’s showcase of his MonoSLAM system from 2004. Andy had to revive his 15-year old laptop (which was running Redhat Linux) to show off his original system, running on the original hardware. If the computer vision community is going to oneway decide on a “retro-vision” demo session, I’m just going to go ahead and nominate Andy for the best-paper prize, right now.

Andry's Retro-Vision SLAM Setup (Pictured on December 18th, 2015)

It was interesting to watch the SLAM system experts wave their USB cameras around, showing their systems build 3D maps of the desk-sized area around their laptops.  If you carefully look at the way these experts move the camera around (i.e., smooth circular motions), you can almost tell how long a person has been working with SLAM. When the non-experts hold the camera, probability of tracking failure is significantly higher.

I had the pleasure of speaking with Andy during the demo session, and I was curious which line of work (in the past 15 years) surprised him the most. His reply was that PTAM, which showed how to perform real-time bundle adjustment, surprised him the most. The PTAM system was essentially a MonoSLAM++ system, but the significantly improved tracking results were due to taking a heavyweight algorithm (bundle adjustment) and making it real-time — something which Andy did not believe was possible in the early 2000s.

Part III: Deep Learning vs SLAM

The SLAM panel discussion was a lot of fun. Before we jump to the important Deep Learning vs SLAM discussion, I should mention that each of the workshop presenters agreed that semantics are necessary to build bigger and better SLAM systems. There were lots of interesting mini-conversations about future directions. During the debates, Marc Pollefeys (a well-known researcher in SfM and Multiple-View Geometry) reminded everybody that Robotics is the killer application of SLAM and suggested we keep an eye on the prize. This is quite surprising since SLAM was traditionally applied to Robotics problems, but the lack of Robotics success in the last few decades (Google Robotics?) has shifted the focus of SLAM away from Robots and towards large-scale map building (ala Google Maps) and Augmented Reality. Nobody at this workshop talked about Robots.

Integrating semantic information into SLAM
There was a lot of interest in incorporating semantics into today’s top-performing SLAM systems. When it comes to semantics, the SLAM community is unfortunately stuck in the world of bags-of-visual-words, and doesn't have new ideas on how to integrate semantic information into their systems. On the other end, we’re now seeing real-time semantic segmentation demos (based on ConvNets) popping up at CVPR/ICCV/ECCV, and in my opinion SLAM needs Deep Learning as much as the other way around.

Integrating semantics into SLAM is often talk about, but it is easier said than done.
Figure 6.9 (page 142) from Moreno's PhD thesis: Dense Semantic SLAM

"Will end-to-end learning dominate SLAM?"
Towards the end of the SLAM workshop panel, Dr. Zeeshan Zia asked a question which startled the entire room and led to a memorable, energy-filled discussion. You should have seen the look on the panel’s faces. It was a bunch of geometers being thrown a fireball of deep learning. Their facial expressions suggest both bewilderment, anger, and disgust. "How dare you question us?" they were thinking. And it is only during these fleeting moments that we can truly appreciate the conference experience. Zia's question was essentially: Will end-to-end learning soon replace the mostly manual labor involved in building today’s SLAM systems?

Zia's question is very important because end-to-end trainable systems have been slowly creeping up on many advanced computer science problems, and there's no reason to believe SLAM will be an exception. A handful of the presenters pointed out that current SLAM systems rely on too much geometry for a pure deep-learning based SLAM system to make sense -- we should use learning to make the point descriptors better, but leave the geometry alone. Just because you can use deep learning to make a calculator, it doesn't mean you should.

Learning Stereo Similarity Functions via ConvNets, by Yan LeCun and collaborators.

While many of the panel speakers responded with a somewhat affirmative "no", it was Newcombe which surprisingly championed what the marriage of Deep Learning and SLAM might look like. 

Newcombe's Proposal: Use SLAM to fuel Deep Learning
Although Newcombe didn’t provide much evidence or ideas on how Deep Learning might help SLAM, he provided a clear path on how SLAM might help Deep Learning.  Think of all those maps that we've built using large-scale SLAM and all those correspondences that these systems provide — isn’t that a clear path for building terascale image-image "association" datasets which should be able to help deep learning? The basic idea is that today's SLAM systems are large-scale "correspondence engines" which can be used to generate large-scale datasets, precisely what needs to be fed into a deep ConvNet.

Concluding Remarks
There is quite a large disconnect between the kind of work done at the mainstream ICCV conference (heavy on machine learning) and the kind of work presented at the real-time SLAM workshop (heavy on geometric methods like bundle adjustment). The mainstream Computer Vision community has witnessed several mini-revolutions within the past decade (e.g., Dalal-Triggs, DPM, ImageNet, ConvNets, R-CNN) while the SLAM systems of today don’t look very different than they did 8 years ago. The Kinect sensor has probably been the single largest game changer in SLAM, but the fundamental algorithms remain intact.
Integrating semantic information: The next frontier in Visual SLAM. 
Brain image from Arwen Wallington's blog post.

Today’s SLAM systems help machines geometrically understand the immediate world (i.e., build associations in a local coordinate system) while today’s Deep Learning systems help machines reason categorically (i.e., build associations across distinct object instances). In conclusion, I share Newcombe and Davison excitement in Visual SLAM, as vision-based algorithms are going to turn Augmented and Virtual Reality into billion dollar markets. However, we should not forget to keep our eyes on the "trillion-dollar" market, the one that's going to redefine what it means to "work" -- namely Robotics. The day of Robot SLAM will come soon.

Tuesday, December 08, 2015

ICCV 2015: Twenty one hottest research papers

"Geometry vs Recognition" becomes ConvNet-for-X

Computer Vision used to be cleanly separated into two schools: geometry and recognition. Geometric methods like structure from motion and optical flow usually focus on measuring objective real-world quantities like 3D "real-world" distances directly from images and recognition techniques like support vector machines and probabilistic graphical models traditionally focus on perceiving high-level semantic information (i.e., is this a dog or a table) directly from images.

The world of computer vision is changing fast has changed. We now have powerful convolutional neural networks that are able to extract just about anything directly from images. So if your input is an image (or set of images), then there's probably a ConvNet for your problem.  While you do need a large labeled dataset, believe me when I say that collecting a large dataset is much easier than manually tweaking knobs inside your 100K-line codebase. As we're about to see, the separation between geometric methods and learning-based methods is no longer easily discernible.

By 2016 just about everybody in the computer vision community will have tasted the power of ConvNets, so let's take a look at some of the hottest new research directions in computer vision.

ICCV 2015's Twenty One Hottest Research Papers

This December in Santiago, Chile, the International Conference of Computer Vision 2015 is going to bring together the world's leading researchers in Computer Vision, Machine Learning, and Computer Graphics.

To no surprise, this year's ICCV is filled with lots of ConvNets, but this time the applications of these Deep Learning tools are being applied to much much more creative tasks. Let's take a look at the following twenty one ICCV 2015 research papers, which will hopefully give you a taste of where the field is going.

1. Ask Your Neurons: A Neural-Based Approach to Answering Questions About Images Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

"We propose a novel approach based on recurrent neural networks for the challenging task of answering of questions about images. It combines a CNN with a LSTM into an end-to-end architecture that predict answers conditioning on a question and an image."

2. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

"To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book."

3. Learning to See by Moving Pulkit Agrawal, Joao Carreira, Jitendra Malik

"We show that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt using class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching."

4. Local Convolutional Features With Unsupervised Training for Image Retrieval Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, Cordelia Schmid

"We introduce a deep convolutional architecture that yields patch-level descriptors, as an alternative to the popular SIFT descriptor for image retrieval."

5. Deep Networks for Image Super-Resolution With Sparse Prior Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, Thomas Huang

"We show that a sparse coding model particularly designed for super-resolution can be incarnated as a neural network, and trained in a cascaded structure from end to end."

6. High-for-Low and Low-for-High: Efficient Boundary Detection From Deep Object Features and its Applications to High-Level Vision Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

"In this work we show how to predict boundaries by exploiting object level features from a pretrained object-classification network."

7. A Deep Visual Correspondence Embedding Model for Stereo Matching Costs Zhuoyuan Chen, Xun Sun, Liang Wang, Yinan Yu, Chang Huang

"A novel deep visual correspondence embedding model is trained via Convolutional Neural Network on a large set of stereo images with ground truth disparities. This deep embedding model leverages appearance data to learn visual similarity relationships between corresponding image patches, and explicitly maps intensity values into an embedding feature space to measure pixel dissimilarities."

8. Im2Calories: Towards an Automated Mobile Vision Food Diary Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, Kevin P. Murphy

"We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories."

9. Unsupervised Visual Representation Learning by Context Prediction Carl Doersch, Abhinav Gupta, Alexei A. Efros

"How can one write an objective function to encourage a representation to capture, for example, objects, if none of the objects are labeled?"

10. Deep Neural Decision Forests Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, Samuel Rota Bulò

"We introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network."

11. Conditional Random Fields as Recurrent Neural Networks Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, Philip H. S. Torr

"We formulate mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks."

12. Flowing ConvNets for Human Pose Estimation in Videos Tomas Pfister, James Charles, Andrew Zisserman

"We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow."

13. Dense Optical Flow Prediction From a Static Image Jacob Walker, Abhinav Gupta, Martial Hebert

"Given a static image, P-CNN predicts the future motion of each and every pixel in the image in terms of optical flow. Our P-CNN model leverages the data in tens of thousands of realistic videos to train our model. Our method relies on absolutely no human labeling and is able to predict motion based on the context of the scene."

14. DeepBox: Learning Objectness With Convolutional Networks Weicheng Kuo, Bharath Hariharan, Jitendra Malik

"Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method."

15. Active Object Localization With Deep Reinforcement Learning Juan C. Caicedo, Svetlana Lazebnik

"This agent learns to deform a bounding box using simple transformation actions, with the goal of determining the most specific location of target objects following top-down reasoning."

16. Predicting Depth, Surface Normals and Semantic Labels With a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

"We address three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling."

17. HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu

"We introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers."

18. FlowNet: Learning Optical Flow With Convolutional Networks Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox

"We construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task."

19. Understanding Deep Features With Computer-Generated Imagery Mathieu Aubry, Bryan C. Russell

"Rendered images are presented to a trained CNN and responses for different layers are studied with respect to the input scene factors."

20. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization Alex Kendall, Matthew Grimes, Roberto Cipolla

"Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation."

21. Visual Tracking With Fully Convolutional Networks Lijun Wang, Wanli Ouyang, Xiaogang Wang, Huchuan Lu

"A new approach for general object tracking with fully convolutional neural network."


While some can argue that the great convergence upon ConvNets is making the field less diverse, it is actually making the techniques easier to comprehend. It is easier to "borrow breakthrough thinking" from one research direction when the core computations are cast in the language of ConvNets. Using ConvNets, properly trained (and motivated!) 21 year old graduate student are actually able to compete on benchmarks, where previously it would take an entire 6-year PhD cycle to compete on a non-trivial benchmark.

See you next week in Chile!

Update (January 13th, 2016)

The following awards were given at ICCV 2015.

Achievement awards

  • PAMI Distinguished Researcher Award (1): Yann LeCun
  • PAMI Distinguished Researcher Award (2): David Lowe
  • PAMI Everingham Prize Winner (1): Andrea Vedaldi for VLFeat
  • PAMI Everingham Prize Winner (2): Daniel Scharstein and Rick Szeliski for the Middlebury Datasets

Paper awards

  • PAMI Helmholtz Prize (1): David MartinCharles FowlkesDoron Tal, and Jitendra Malik for their ICCV 2001 paper "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics".
  • PAMI Helmholtz Prize (2): Serge BelongieJitendra Malik, and Jan Puzicha, for their ICCV 2001 paper "Matching Shapes".
  • Marr Prize: Peter KontschiederMadalina FiterauAntonio Criminisi, and Samual Rota Bulo, for "Deep Neural Decision Forests".
  • Marr Prize honorable mention: Saining Xie and Zhuowen Tu for"Holistically-Nested Edge Detection".
For more information about awards, see Sebastian Nowozin's ICCV-day-2 blog post.

I also wrote another ICCV-related blog post (January 13, 2016) about the Future of Real-Time SLAM.

Saturday, November 07, 2015

The Deep Learning Gold Rush of 2015

In the last few decades, we have witnessed major technological innovations such as personal computers and the internet finally reach the mainstream. And with mobile devices and social networks on the rise, we're now more connected than ever. So what's next? When is it coming? And how will it change our lives? Today I'll tell you that the next big advance is well underway and it's being fueled by a recent technique in the field of Artificial Intelligence known as Deep Learning.

The California Gold Rush of 2015 is all about Deep Learning. 
It's everywhere, you just don't know how to look.

All of today's excitement in Artificial Intelligence and Machine Learning stems from ground-breaking results in speech and visual object recognition using Deep Learning[1]. These algorithms are being applied to all sorts of data, and the learned deep neural networks outperform traditional expert systems carefully designed by scientists and engineers. End-to-end learning of deep representations from raw data is now possible due to a handful of well-performing deep learning recipes (ConvNets, Dropout, ReLUs, LSTM, DQNImageNet). But if there's one final takeaway that we can extract from decades of machine learning research, is that for many problems going deep isn't a choice, it's often a requirement.

Most of the apps and services you're already using (AirBnB, Snapchat,, Uber, Yelp, LinkedIn, etc) are quite data-hungry and before you know it, they're all going to go mega-deep. So whether you need to revitalize your data science team with deep learning or you're starting an AI-from-day-one operation, it's pretty clear that everybody is rushing to get some of this Silicon Valley Gold.

From Titans to Gold Miners: Your atypical Gold Rush

Like all great gold rushes, this movement is led by new faces, which are pouring into Silicon Valley like droves. But these aren't your typical unskilled immigrants willing to pick up a hammer, nor your fresh computer science grads with some app-writing skills. The key deep learning players of today (known as the Titans of Deep Learning) are computer science professors and researchers (seldom born in the USA) leaving their academic posts and bringing their students and ideas straight into Silicon Valley.

"Turn on, Tune in, Dropout" -- Timothy Leary

Recently, Google and Facebook announced that their operations are now being powered by Deep Learning [2,3]. And with most Deep Learning Titans representing the tech giants (Yann LeCun at Facebook Research, Geoffrey Hinton at Google, Andrew Ng at Baidu), Deep Learning is likely to become one of the most sought after tech skills. With Toyota to invest in $1 Billion in Robotics and Artificial Intelligence Research (November 6, 2015), the announcement of YC Research (October 7, 2015), and the new Google Brain Residency Program "Pre-doc" AI jobs (October 26, 2015), Silicon Valley just got a whole lot more interesting.

Silicon Valley re-defines itself, yet again 

To understand why it took so long for Deep Learning to take-off, let's take a brief look at the key technologies which defined Silicon Valley over the last 50 years.  The following timeline gives an overview of where Silicon Valley has been and where it's going.

1970s: Semiconductors 
The story of the digital-era starts with semiconductors. "Silicon" in "Silicon Valley" originally referred to the silicon chip or integrated circuit innovations as well as the location (close to Stanford) of much tech-related activity. The dominant firm from that time period was Fairchild Semiconductor International and it eventually gave rise to more recognizable companies like Intel. For a more detailed discussion of this birthing era, take a look at Steve Blank's Secret History of Silicon Valley[4].
Read more about Fairchild at TechCrunch's First Trillion-Dollar Startup 

1980s: Personal Computers
Initially computers were quite large and used solely by research labs, government, and big businesses. But it was the personal computer which turned computer programming from a hobby into a vital skill. You no longer needed to be an MIT student to program on one of these badboys. While both Microsoft and Apple were founded in 1975 and 1976, respectively, they persevered due to their pioneering work in graphical user interfaces. This was the birth of the modern user-friendly Operating System. IBM approached Microsoft in 1980, regarding its upcoming personal computer, and from then on Microsoft would be King for a very long time.

See Mac-history's article on Microsoft's relationship with Apple

1990s: Internet
While the nerds at Universities were posting ascii messages on newsgroups in the 90s, service providers in the 1990s like AOL helped make the internet accessible to everyone. Remember getting all those AOL disks in the mail? Buying a chunk of digital real state (your own domain name) became possible and anybody with a dial up connection and some primitive text/HTML skills could start posting online content. With a mission statement like "organize the world's information", it was eventually Google that got the most of out the late 90s dot-com bubble, and remains a very strong player in all things tech.

2000s: Mobile and Social
While the dot-com bubble was about creating an online presence for startups and established companies, the way we use the internet has dramatically changed since 2001. A ton of new social communities have emerged, and due to Facebook we're now stars in our own reality show. Social and advertising have essentially turned the modern internet into a mainstream TV-like experience. The internet is no longer only for the nerds. The kings of this era (Google and Facebook) are also the biggest players in the Deep Learning space, because they have the largest user bases and in-house apps which can benefit most from machine learning.

2010-2015: Deep Learning comes to the party
Spend more than a day in Silicon Valley and you'll hear the popular expression, "Software is eating the world." Rampant spreading of software was only possible once the internet (1990s) AND mobile devices (2000s) became essential parts of our lives. No longer do we physically mail floppy disks, and social media fuels any app that goes viral. What traditional software is missing (or has been missing up until now) is the ability to improve over time from everyday use. If that same software is able to connect to a large Deep Learning system and start improving, then we have a game-changer on our hands. This is already happening with online advertising, digital assistants like Siri, and smart auto-responders like Google's new email auto-reply feature.

The hierarchical award-winning "AlexNet" Deep Learning architecture 
Visualized using MIT's Toolbox for Deep Learning Neuron Visualization

Massive hiring of deep learning experts by the leading tech companies has only begun, but we also should be on the lookout for new ventures built on top of Deep Learning, not just a revitalization of last decade's successes. On this front, keep a close look at the following Deep Learning Cloud Service upstarts: Richard Socher from MetaMindMatthew Zeiler from Clarifai, and Carlos Guestrin from Dato.

2015-2020: Deep Learning Revitalizes Robotics
Recently it has been shown that Deep Learning can be used to help robots learn tasks involving movement, object manipulation, and decision making[6,7,8,9]. Before Deep Learning, lots of different pieces of robotic software and hardware would have to be developed independently and then hacked together for demo day. Today, you can use one of a handful of "Deep Learning for Robotics recipes" and start watching your robot learn the task you care about.

Robots Learns to Grasp using Deep Learning at Carnegie Mellon University. 

With their 2013 acquisition of Boston Dynamics (a hardware play), 2014 acquisition of DeepMind (a software play), and a serious autonomous car play, Google is definitely early to the Robotics party. But the noteworthy bits are happening at the intersection of deep learning and robotics.  I suggest taking a closer look at the Robotics research of Pieter Abbeel of Berkeley, Abhinav Gupta of Carnegie Mellon, and Ashutosh Saxena of Stanford -- all likely stars in the next Deep Learning for Robotics race. As long as Rodney Brooks keeps creating innovative Robotics platforms like Baxter, my expectations for Robotics are off the charts.


Unlike in 1849, the Deep Learning Gold Rush of 2015 is not going to bring some 300,000 gold-seekers in boats to California's mainland. This isn't a bring-your-own-hammer kind of game -- the Titans have already descended from their Ivory Towers and handed us ample mining tools. But it won't hurt to gain some experience with traditional "shallow" machine learning techniques so you can appreciate the power of Deep Learning.

I hope you enjoyed today's read and have a better sense of how Silicon Valley is undergoing a transformation. And remember, today's wave of Deep Learning upstart CEOs have PhDs, but once Deep Learning software becomes more user-friendly (TensorFlow?), maybe you won't have to wait so long to dropout.


[1] Krizhevsky, A., Sutskever, I. and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS 2012.
[2] D'Onfro, J. Google is 're-thinking' all of its products to include machine learning. Business Insider. October 22, 2015.
[3] D'Onfro, J. How Facebook will use artificial intelligence to organize insane amounts of data into the perfect News Feed and a personal assistant with superpowers. Business Insider. November 3, 2015.
[4] Blank, S. Secret History of Silicon Valley. 2008.
[5] Donglai Wei, Bolei Zhou, Antonio Torralba William T. Freeman. mNeuron: A Matlab Plugin to Visualize Neurons from Deep Models. 2015.
[6] Lerrel Pinto, Abhinav Gupta. Supersizing Self-supervision: Learning to Graspfrom 50K Tries and 700 Robot Hours. arXiv. 2015.
[7] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. In RSS 2015.
[8] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
[9] Ian Lenz, Ross Knepper, and Ashutosh Saxena. DeepMPC: Learning Deep Latent Features for Model Predictive Control.  In Robotics Science and Systems (RSS), 2015