Tombone's Computer Vision Blog

Tuesday, November 19, 2019

Computer Vision and Visual SLAM vs. AI Agents

With all the recent advancements in end-to-end deep learning, it is now possible to train AI agents to perform many different tasks (some in simulation and some in the real-world). End-to-end learning allows one to replace a multi-component, hand-engineered system with a single learning network that can process raw sensor data and output actions for the AI to take in the physical world. I will discuss the implications of these ideas while highlighting some new research trends regarding Deep Learning for Visual SLAM and conclude with some predictions regarding the kinds of spatial reasoning algorithms that we will need in the future.

In today's article, we will go over three ideas:

I.) Does Computer Vision Matter for Action?

II.) Visual SLAM for AI agents

III.) Quō vādis Visual SLAM? Trends and research forecast

I. Does Computer Vision Matter for Action?

At last month's International Conference of Computer Vision (ICCV 2019), I heard the following thought-provoking question,

"What do Artificial Intelligence Agents need (if anything) from the field of Computer Vision?"

The question was posed by Vladlen Koltun (from Intel Research) during his talk at the Deep Learning for Visual SLAM Workshop at ICCV 2019 in Seoul. He spoke about building AI agents with and without the aid of computer vision to guide representation learning. While Koltun has worked on classical Visual SLAM (see his Direct Sparse Odometry (DSO) system [2]), at this workshop, he decided to not speak about his older work on geometry, alignment, or 3D point cloud processing. His talk included numerous ideas spanning several of his team's research papers, some humor (see video), and plenty of Koltun's philosophical views towards general artificial intelligence.

Recent techniques show that it is possible to learn actions (the output quantities that we really want) from pixels (raw inputs) directly without any intermediate computer vision processing like object recognition, depth estimation, and segmentation. But just because it is possible to solve some AI tasks without intermediate representations (i.e., the computer vision stuff), does that mean that we should abandon computer vision research and let end-to-end learning take care of everything? Probably not.

From a very practical standpoint, let's ask the following question:

"Is an agent who is aware of computer vision stuff more robust than an agent trained without intermediate representations?"

Recent research from Koltun's lab [3] indicates that the answer is yes: training with intermediate representations, as done by supervision from per-frame computer vision tasks, gives rise to more robust agents that learn faster and are more robust in a variety of performance tasks! The next natural question is: which computer vision tasks matter most for agent robustness? Koltun's research suggests that depth estimation is one particular task that works well as an auxiliary task when training agents that have to move through space (i.e., most video games). A depth estimation network should help an AI agent navigate an unknown environment as depth estimation is one key component in many of today's RGBD Visual SLAM systems. The best way to learn about Koltun's paper, titled Does Computer Vision Matter for Action?, is to see the video on YouTube.

Video describing Koltun's Does Computer Vision Matter for Action? [3]

Let's imagine that you want to deploy a robot into the world sometimes from now until 2025 based on your large-scale AI agent training, and you're debating whether you should avoid intermediate representations or not.

Intermediate representations facilitate explainability, debuggability, and testing. Explainability is a key to success when systems require spatial reasoning capabilities in the real-world. If your agents are misbehaving, take a look at their intermediate representations. If you want to improve your AI, you can analyze the computer vision systems to prioritize better your data collection effort. Visualization should be a first-order citizen in your deep learning toolbox.

But today's computer vision ecosystem offers more than algorithms that process individual images. Visual SLAM systems rapidly process images while updating the camera's trajectory and updating the 3D map of the world. Visual SLAM, or VSLAM, algorithms are the real-time variants of Structure-from-Motion (SfM), which has been around for a while. SfM uses bundle adjustment -- a minimization of reprojection error, usually solved with Levenberg Marquardt. If there any kind of robot you see moving around today (2019), it is likely that it is running some variant of SLAM (localization and mapping) and not an end-to-end trained network -- at least not today. So what does Visual SLAM mean for AI agents?

II. Visual SLAM for AI Agents

While no single per-frame computer vision algorithm is close to sufficient to enable robust action in an environment, there is a class of real-time computer vision systems like Visual SLAM that can be used to guide agents through space. The Workshop on Deep Learning for Visual SLAM at ICCV 2019 showcased a variety of different Visual SLAM approaches and included a discussion panel. The workshop featured talks on Visual SLAM on mobile platforms (Victor Prisacariu from 6d.ai), autonomous cars (Daniel Cremers from TUM and ArtiSense.ai), high-detail indoor modeling (Angela Dai from TUM), AI Agents (Vladlen Koltun from Intel Research) and mixed-reality (Tomasz Malisiewicz from Magic Leap).

Teaser Image for the 2nd Workshop on Deep Learning for Visual SLAM from Ronnie Clark.
See info at http://visualslam.ai

When it comes to spatial perception capabilities, Koltun's talk made it clear that we, as computer vision researchers, could think bolder. There is a spectrum of spatial perception capabilities that AI agents need that only somewhat overlaps with traditional Visual SLAM (whether deep learning-based or not).

Koltun's work is in favor of using intermediate representations based on computer vision to produce more robust AI agents. However, Koltun is not convinced that 6dof Visual SLAM, as is currently defined, needs to be solved for AI agents. Let's consider ordinary human tasks like walking, washing your hands, and flossing your teeth -- each one requires a different amount of spatial reasoning abilities. It is reasonable to assume that AI agents would need varying degrees of spatial localization and mapping capabilities to perform such tasks.

Visual SLAM techniques, like the ones used inside Augmented Reality systems, build metric 3D maps of the environment for the task of high-precision placement of digital content -- but such high-precision systems might never be used directly inside AI agents. When the camera is hand-held (augmented reality) or head-mounted (mixed reality), a human decides where to move. AI agents have to make their own movement decisions, and this requires more than feature correspondences and bundle adjustment -- more than what is inside the scope of computer vision.

Inside a head-mounted display, you might look at digital content 30 feet away from you, and for everything to look correct geometrically, you must have a decent 3D map of the world (spanning at least 30 feet) and a reasonable estimate of your pose. But for many tasks that AI agents need to perform, metric-level representations of far-away geometry are un-necessary. It is as if proper action requires local, high-quality metric maps and something coarser like topological maps for large-range maps. Visual SLAM systems (stereo-based and depth-sensor based) are likely to find numerous applications in industry such as mixed reality and some branches of robotics, where millimeter precision matters.

More general end-to-end learning for AI agents will show us new kinds of spatial intelligence, automatically learned from data. There is a lot of exciting research to be done to answer questions like the following: What kind of tasks can we train Visual AI Agents for such that map-building and localization capabilities arise? Or What type of core spatial reasoning capabilities can we pre-build to enable further self-supervised learning from the 3D world?

III. Quō vādis Visual SLAM? Trends and research forecast

At the Deep Learning Workshop for Visual SLAM, an interesting question that came up in the panel focused on the convergence of methods in Visual SLAM. Or alternatively,

"Will a single Visual SLAM framework rule them all?"

The world of applied research is moving towards more deep learning -- by 2019, many of the critical tasks inside computer vision exist as some form of a (convolutional/graph) neural network. I don't believe that we will see a single SLAM framework/paradigm dominate all others -- I think we will see a plurality of Visual SLAM systems based on inter-changeable deep learning components. This new generation of deep learning-based components will allow more creative applications of end-to-end learning and be typically useful as modules within other real-world systems. We should create tools that will enable others to make better tools.

PyTorch is making it easy to build multiple-view geometry tools like Kornia -- such that the right parts of computer vision are brought directly into today's deep learning ecosystem as first-order citizens. And PyTorch is winning over the world of research. A dramatic increase in usage happened from 2017 to 2019, with PyTorch now the recommended framework amongst most of my fellow researchers.

To take a look at what the end goal in terms of end-to-end deep learning for visual SLAM might look like, take a look at gradSLAM from Krishna Murthy, a Ph.D. student in MILA, and collaborators at CMU. Their paper offers a new way of thinking of SLAM as made up of differentiable blocks. From the article, "This amalgamation of dense SLAM with computational graphs enables us to backprop from 3D maps to 2D pixels, opening up new possibilities in gradient-based learning for SLAM."

Key Figure from the gradSLAM paper on end-to-end learning for SLAM. [5]

Another key trend that seems to be on the rise inside the context of Deep Visual SLAM is self-supervised learning. We are seeing more and more practical successes of self-supervised learning for multi-view problems where geometry enables us to get away from strong supervision. Even the ConvNet-based point detector SuperPoint [7], which my team and I developed at Magic Leap, uses self-supervision to train more robust interest point detectors. In our case, it was impossible to get ground truth interest points on images, and self-labeling was the only way out. One of my favorite researchers working on self-supervised techniques is Adrien Gaidon from TRI, who studies how such methods can be used to make smarter cars. Adrien gave some great talks at other ICCV 2019 Workshops related to autonomous vehicles, and his work is closely related to Visual SLAM and useful for anybody working on similar problems.

Adrien Gaidon's talk from October 11th, 2019 on Self-Supervised Learning in the context of Autonomous Cars

Another excellent presentation about this topic from Alyosha Efros. He does a great job convincing you why you should love self-supervision.

A presentation about self-supervision from Alyosha Efros on May 25th, 2018

Conclusion

As more and more spatial reasoning skills get baked into deep networks, we must face two opposing forces. On the one hand, specifying internal representations makes it difficult to scale to new tasks -- it is easier to trick the deep nets into doing all the hard work for you. On the other hand, we want interpretability and some amount of safety when we deploy AI agents into the real world, so some intermediate tasks like object recognition are likely to be involved in today's spatial perception recipe. Lots of exciting work is happening with multi-agents from OpenAI [6], but full end-to-end learning will not give real-world robots such as autonomous cars anytime soon.

Video from OpenAI showing Multi-Agent Hide and Seek. [6]

More practical Visual SLAM research will focus on differentiable high-level blocks. As more deep learning happens in Visual SLAM, it will create a renaissance in Visual SLAM as sharing entire SLAM systems will be as easy as sharing CNNs today. I cannot wait until the following is possible:

pip install DeepSLAM

I hope you enjoyed learning about the different approaches to Visual SLAM, and that you have found my blog post insightful and educational. Until next time!

References:

[1]. Vladlen Koltun. Chief Scientist for Intelligent Systems at Intel. http://vladlen.info/

[2]. Direct Sparse Odometry. Jakob Engel, Vladlen Koltun, and Daniel Cremers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 2018. http://vladlen.info/publications/direct-sparse-odometry/

[3]. Does Computer Vision Matter for Action? Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Science Robotics, 4(30), 2019. http://vladlen.info/publications/computer-vision-matter-action/

[4]. Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Winter Conference on Applications of Computer Vision, 2019. https://kornia.github.io/

[5]. gradSLAM: Dense SLAM meets Automatic Differentiation. Krishna Murthy J., Ganesh Iyer, and Liam Paull. In arXiv, 2019. http://montrealrobotics.ca/gradSLAM/

[6] Emergent tool use from multi-agent autocurricula. Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. In arXiv 2019. https://openai.com/blog/emergent-tool-use/
[7] SuperPoint: Self-supervised interest point detection and description. Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018. https://arxiv.org/abs/1712.07629

Wednesday, May 16, 2018

DeepFakes: AI-powered deception machines

Driven by computer vision and deep learning techniques, a new wave of imaging attacks has recently emerged which allows anyone to easily create highly realistic "fake" videos. These false videos are known as Deep Fakes. While highly entertaining at times, DeepFakes can be used to perturb society and some would argue that the pre-shock has already begun. A rogue DeepFake which goes viral can spread misinformation across the internet like wildfire.

"The ability to effortlessly create visually plausible editing of faces in videos has the potential to severely undermine trust in any form of digital communication. "

--Rössler et al. FaceForensics [3]

Because DeepFakes contain a unique combination of realism and novelty, they are more difficult to detect on social networks as compared to traditional "bad" content like pornography and copyrighted movies. Video hashing might work for finding duplicates or copyright-infringing content, but not good enough for DeepFakes. To fight face-manipulating DeepFake AI, one needs an even stronger AI.

As today's DeepFakes are based on Deep Learning, and Deep Learning tools like TensorFlow and PyTorch are accessible to anybody with a modern GPU, such face manipulation tools are particularly disruptive. The democratization of Artificial Intelligence has brought us near infinite use-cases. From the DeepDream phenomenon of 2015 to the Deep Style Transfer Art apps of 2016, 2018 is the year of the DeepFake. Today's computer vision technology allows a hobbyist to create a Deep Fake video of just about any person they want performing any action they want, in a matter of hours, using commodity computer hardware.

Fig 1. DeepFakes generate "false impressions" which are attacks on the human mind.

What is a Deep Fake?
A deep fake is a video generated from a modern computer vision puppeteering face-swap algorithm which can be used to generate a video of target person X performing target action A, usually given a video of another person Y performing action A. The underlying system learns two face models, one of target person X, and of for person Y, the person in the original video. It then learns a mapping between the two faces, which can be used to create the resulting "fake" video. Techniques for facial reenactment have been pioneered by movie studios for driving character animations from real actors' faces, but these techniques are now emerging as deep learning-based software packages, letting the deep convolutional neural networks do most of the work during model training.

Consider the following collage of faces. Can you guess which ones are real and which ones are DeepFakes?

Fig 2. Can you tell which faces are real and which ones are fake?

Figure from Face Forensics[3]

It is not so easy to tell which image is modified and which one is unadulterated. And if you do a little bit of searching for DeepFakes (warning, unless you are careful, you will encounter lots of pornographic content) you notice that the faces in those videos look very realistic.

How are Deep Fakes made?
While there are conceptually many different ways to make Deep Fakes, today we'll focus on two key underlying techniques: face detection from videos, and deep learning for creating frame alignments between source face X and target face Y.

A lot of this research started with the Face2face work [1] presented at CVPR 2016. This paper was a modernization of the group's earlier SIGGRAPH paper and focused a lot more on the computer vision details. At this time the tools were good enough to create SIGGRAPH-quality videos, but it took a lot of work to put together a facial reenactment rig. In addition, the underlying algorithms did not use any deep learning, so a lot of domain-knowledge (i.e., face modeling expertise) went into making these algorithms work robustly. The TUM/Stanford guys filed their Real-time facial reenactment patent in 2016 [4], and have more recently worked on FaceForensics[3] to detect such manipulated imagery.

Fig3. Face2Face technique from 2016. It is 2018 now, so just imagine how much better this works now!

In addition to the Face2face guys (who have now a handful of similarly themed papers), it is interesting to note that a lot of key early ideas in face puppeteering were pioneered by Ira Kemelmacher-Shlizerman who is now a computer vision and graphics assistant professor at University of Washington. She worked on early face puppeteering technology for the 2010 paper Being John Malkovich, continued with the Photobios work, and later founded Dreambit (based on a SIGGRAPH 2016 paper), which was acquired by Facebook. :-)

Fig 4. Ira's early work on face swapping in 2010. See the Being John Malkovich paper[2].

Take a look at Ira's Dreambit video, which shows some high-quality "entertainment" value out of rapidly produced non-malicious DeepFakes!

Fig 5. Ira's Dreambit system. Lets her imagine herself in different eras, with different hairstyles, etc.

The origin of Ira's Dreambit system is the Transfiguring Portraits SIGGRAPH 2016 paper[6]. What's important to note is that this is 2016 and we're starting to see some use of Deep Learning. The transfiguring portraits work used a big mix of features, using some CNN features computed from early Caffe networks. It is not an entirely easy-to-use system at this point, but good enough to make SIGGRAPH videos, take a one minute to generate other cool outputs, and definitely cool enough for Facebook to acquire.

Fig 6. Transfiguring Portraits. The system used lots of features, but Deep Learning-based CNN features are starting to show up.

Fighting against DeepFakes
There are now published algorithms which try to battle DeepFakes by determining if faces/videos are fake or not. FaceForensics[3] introduces a large DeepFake dataset based on their earlier Face2face work. This dataset contains both real and "fake" Face2face output videos. More importantly, the new dataset is big enough to train a deep learning system to determine if an image is counterfeit. In addition, they are able to both 1.) determine which pixels have likely been manipulated, and 2.) perform a deep cleanup stage to make even better DeepFakes.

Fig 7. The "fakeness" masks in FaceForensics[3] are based on XceptionNet

Another fake detection approach, this time from a Berkeley AI Research group called Image Splice Detection, focuses on detecting where an image was spliced to create a fake composite image. This allows them to determine which part of the image was likely "photoshopped" and the technique is not specific to faces. And because this is a 2018 paper, it should not be a surprise that this kind of work is all based on deep learning techniques.

Fig 8. Fighting Fake News: Image Splice Detection[5]. Response maps are aggregated to determine the combined probability mask.[5]

From the Fighting Fake News paper,

"As new advances in computer vision and image-editing emerge, there is an increasingly urgent need for effective visual forensics methods. We see our approach, which successfully detects manipulations without seeing examples of manipulated images, as being an initial step toward building general-purpose forensics tools."

Concluding Remarks
The early DeepFake tools were pioneered in the early 2010s and were producing SIGGRAPH-quality results by 2015. It was only a matter of years until DeepFake generators became publicly available. 2018's DeepFake generators, being written on top of open-source Deep Learning libraries, are much easier to use than the researchy systems from only a few years back. Today, just about any hobbyist with minimal computer programming knowledge and a GPU can build their own DeepFakes.

Just as Deep Fakes are getting better, Generative Adversarial Networks are showing more promise for photorealistic image generation. It is likely that we will soon see lots of exciting new work on both the generative side (deep fake generation) and the discriminative side (deep fake detection and image forensics) which incorporate more and more ideas from the machine learning community.

References

[1] Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. "Face2face: Real-time face capture and reenactment of rgb videos." In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 2387-2395. IEEE, 2016.

[2] Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. "Being john malkovich." In European Conference on Computer Vision, pp. 341-353. Springer, 2010.

[3] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. "FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces." arXiv preprint arXiv:1803.09179, 2018.

[4] Christian Theobalt, Michael Zollhöfer, Marc Stamminger, Justus Thies, Matthias Nießner. Real-time Expression Transfer for Facial Reenactment Invention. 2018/3/8. Application Number 15256710

[5] Minyoung Huh, Andrew Liu, Andrew Owens, Alexei A. Efros, "Fighting Fake News: Image Splice Detection via Learned Self-Consistency." arXiv preprint arXiv:1805.04096, 2018

[6] Ira Kemelmacher-Shlizerman, "Transfiguring portraits." ACM Transactions on Graphics (TOG), 35(4), p.94. 2016

Friday, December 16, 2016

Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016

You might go to a cutting-edge machine learning research conference like NIPS hoping to find some mathematical insight that will help you take your deep learning system's performance to the next level. Unfortunately, as Andrew Ng reiterated to a live crowd of 1,000+ attendees this past Monday, there is no secret AI equation that will let you escape your machine learning woes. All you need is some rigor, and much of what Ng covered is his remarkable NIPS 2016 presentation titled "The Nuts and Bolts of Building Applications using Deep Learning" is not rocket science. Today we'll dissect the lecture and Ng's key takeaways. Let's begin.

Figure 1. Andrew Ng delivers a powerful message at NIPS 2016.

Andrew Ng and the Lecture

Andrew Ng's lecture at NIPS 2016 in Barcelona was phenomenal -- truly one of the best presentations I have seen in a long time. In a juxtaposition of two influential presentation styles, the CEO-style and the Professor-style, Andrew Ng mesmerized the audience for two hours. Andrew Ng's wisdom from managing large scale AI projects at Baidu, Google, and Stanford really shows. In his talk, Ng spoke to the audience and discussed one of they key challenges facing most of the NIPS audience -- how do you make your deep learning systems better? Rather than showing off new research findings from his cutting-edge projects, Andrew Ng presented a simple recipe for analyzing and debugging today's large scale systems. With no need for equations, a handful of diagrams, and several checklists, Andrew Ng delivered a two-whiteboards-in-front-of-a-video-camera lecture, something you would expect at a group research meeting. However, Ng made sure to not delve into Research-y areas, likely to make your brain fire on all cylinders, but making you and your company very little dollars in the foreseeable future.

Money-making deep learning vs Idea-generating deep learning

Andrew Ng highlighted the fact that while NIPS is a research conference, many of the newly generated ideas are simply ideas, not yet battle-tested vehicles for converting mathematical acumen into dollars. The bread and butter of money-making deep learning is supervised learning with recurrent neural networks such as LSTMs in second place. Research areas such as Generative Adversarial Networks (GANs), Deep Reinforcement Learning (Deep RL), and just about anything branding itself as unsupervised learning, are simply Research, with a capital R. These ideas are likely to influence the next 10 years of Deep Learning research, so it is wise to focus on publishing and tinkering if you really love such open-ended Research endeavours. Applied deep learning research is much more about taming your problem (understanding the inputs and outputs), casting the problem as a supervised learning problem, and hammering it with ample data and ample experiments.

"It takes surprisingly long time to grok bias and variance deeply, but people that understand bias and variance deeply are often able to drive very rapid progress."

--Andrew Ng

The 5-step method of building better systems

Most issues in applied deep learning come from a training-data / testing-data mismatch. In some scenarios this issue just doesn't come up, but you'd be surprised how often applied machine learning projects use training data (which is easy to collect and annotate) that is different from the target application. Andrew Ng's discussion is centered around the basic idea of bias-variance tradeoff. You want a classifier with a good ability to fit the data (low bias is good) that also generalizes to unseen examples (low variance is good). Too often, applied machine learning projects running as scale forget this critical dichotomy. Here are the four numbers you should always report:

Training set error
Testing set error
Dev (aka Validation) set error
Train-Dev (aka Train-Val) set error

Andrew Ng suggests following the following recipe:

Figure 2. Andrew Ng's "Applied Bias-Variance for Deep Learning Flowchart"

for building better deep learning systems.

Take all of your data, split it into 60% for training and 40% for testing. Use half of the test set for evaluation purposes only, and the other half for development (aka validation). Now take the training set, leave out a little chunk, and call it the training-dev data. This 4-way split isn't always necessary, but consider the worse case where you start with two separate sets of data, and not just one: a large set of training data and a smaller set of test data. You'll still want to split the testing into validation and testing, but also consider leaving out a small chunk of the training data for the training-validation. By reporting the data on the training set vs the training-validation set, you measure the "variance."

Figure 3. Human-level vs Training vs Training-dev vs Dev vs Test.

Taken from Andrew Ng's 2016 talk.

In addition to these four accuracies, you might want to report the human-level accuracy, for a total of 5 quantities to report. The difference between human-level and training set performance is the Bias. The difference between the training set and the training-dev set is the Variance. The difference between the training-dev and dev sets is the train-test mismatch, which is much more common in real-world applications that you'd think. And finally, the difference between the dev and test sets measures how overfitting.

Nowhere in Andrew Ng's presentation does he mention how to use unsupervised learning, but he does include a brief discussion about "Synthesis." Such synthesis ideas are all about blending pre-existing data or using a rendering engine to augment your training set.

Conclusion
If you want to lose weight, gain muscle, and improve your overall physical appearance, there is no magical protein shake and no magical bicep-building exercise. The fundamentals such as reduced caloric intake, getting adequate sleep, cardiovascular exercise, and core strength exercises like squats and bench presses will get you there. In this sense, fitness is just like machine learning -- there is no secret sauce. I guess that makes Andrew Ng the Arnold Schwarzenegger of Machine Learning.

What you are most likely missing in your life is the rigor of reporting a handful of useful numbers such as performance on the 4 main data splits (see Figure 3). Analyzing these numbers will let you know if you need more data or better models, and will ultimately let you hone in your expertise on the conceptual bottleneck in your system (see Figure 2).

With a prolific research track record that never ceases to amaze, we all know Andrew Ng as one hell of an applied machine learning researcher. But the new Andrew Ng is not just another data-nerd. His personality is bigger than ever -- more confident, more entertaining, and his experience with a large number of academic and industrial projects makes him much wiser. With enlightening lectures as "The Nuts and Bolts of Building Applications with Deep Learning" Andrew Ng is likely to be an individual whose future keynotes you might not want to miss.

Appendix
You can watch a September 27th, 2016 version of the Andrew Ng Nuts and Bolts of Applying Deep Learning Lecture on YouTube, which he delivered at the Deep Learning School. If you are working on machine learning problems in a startup, then definitely give the video a watch. I will update the video link once/if the newer NIPS 2016 version shows up online.

You can also check out Kevin Zakka's blog post for ample illustrations and writeup corresponding to Andrew Ng's entire talk.

Friday, June 17, 2016

Making Deep Networks Probabilistic via Test-time Dropout

In Quantum Mechanics, Heisenberg's Uncertainty Principle states that there is a fundamental limit to how well one can measure a particle's position and momentum. In the context of machine learning systems, a similar principle has emerged, but relating interpretability and performance. By using a manually wired or shallow machine learning model, you'll have no problem understanding the moving pieces, but you will seldom be happy with the results. Or you can use a black-box deep neural network and enjoy the model's exceptional performance. Today we'll see one simple and effective trick to make our deep black boxes a bit more intelligible. The trick allows us to convert neural network outputs into probabilities, with no cost to performance, and minimal computational overhead.

Interpretability vs Performance: Deep Neural Networks perform well on most computer vision tasks, yet they are notoriously difficult to interpret.

The desire to understand deep neural networks has triggered a flurry of research into Neural Network Visualization, but in practice we are often forced to treat deep learning systems as black-boxes. (See my recent Deep Learning Trends @ ICLR 2016 post for an overview of recent neural network visualization techniques.) But just because we can't grok the inner-workings of our favorite deep models, it doesn't mean we can't ask more out of our deep learning systems.

There exists a simple trick for upgrading black-box neural network outputs into probability distributions.

The probabilistic approach provides confidences, or "uncertainty" measures, alongside predictions and can make almost any deep learning systems into a smarter one. For robotic applications or any kind of software that must make decisions based on the output of a deep learning system, being able to provide meaningful uncertainties is a true game-changer.

Applying Dropout to your Deep Neural Network is like occasionally zapping your brain

The key ingredient is dropout, an anti-overfitting deep learning trick handed down from Hinton himself (Krizhevsky's pioneering 2012 paper). Dropout sets some of the weights to zero during training, reducing feature co-adaptation, thus improving generalization.

Without dropout, it is too easy to make a moderately deep network attain 100% accuracy on the training set.

The accepted knowledge is that an un-regularized network (one without dropout) is too good at memorizing the training set. For a great introductory machine learning video lecture on dropout, I highly recommend you watch Hugo Larochelle's lecture on Dropout for Deep learning.

Geoff Hinton's dropout lecture, also a great introduction, focuses on interpreting dropout as an ensemble method. If you're looking for new research ideas in the dropout space, a thorough understanding of Hinton's interpretation is a must.

But while dropout is typically used at training-time, today we'll highlight the keen observation that dropout used at test-time is one of the simplest ways to turn raw neural network outputs into probability distributions. Not only does this probabilistic "free upgrade" often improve classification results, it provides a meaningful notion of uncertainty, something typically missing in Deep Learning systems.

The idea is quite simple: to estimate the predictive mean and predictive uncertainty, simply collect the results of stochastic forward passes through the model using dropout.

How to use dropout: 2016 edition

Start with a moderately sized network
Increase your network size with dropout turned off until you perfectly fit your data
Then, train with dropout turned on
At test-time, turn on dropout and run the network T times to get T samples
The mean of the samples is your output and the variance is your measure of uncertainty

Remember that drawing more samples will increase computation time during testing unless you're clever about re-using partial computations in the network. Please note that if you're only using dropout near the end of your network, you can reuse most of the computations. If you're not happy with the uncertainty estimates, consider adding more layers of dropout at test-time. Since you'll already have a pre-trained network, experimenting with test-time dropout layers is easy.

Bayesian Convolutional Neural Networks

To be truly Bayesian about a deep network's parameters, we wouldn't learn a single set of parameters w, we would infer a distribution over weights given the data, p(w|X,Y). Training is already quite expensive, requiring large datasets and expensive GPUs.

Bayesian learning algorithms can in theory provide much better parameter estimates for ConvNets and I'm sure some of our friends at Google are working on this already.

But today we aren't going to talk about such full Bayesian Deep Learning systems, only systems that "upgrade" the model prediction y to p(y|x,w). In other words, only the network outputs gain a probabilistic interpretation.

An excellent deep learning computer vision system which uses test-time dropout comes from a recent University of Cambridge technique called SegNet. The SegNet approach introduced an Encoder-Decoder framework for dense semantic segmentation. More recently, SegNet includes a Bayesian extension that uses dropout at test-time for providing uncertainty estimates. Because the system provides a dense per-pixel labeling, the confidences can be visualized as per-pixel heatmaps. Segmentation system is not performing well? Just look at the confidence heatmaps!

Bayesian SegNet. A fully convolutional neural network architecture which provides

per-pixel class uncertainty estimates using dropout.

The Bayesian SegNet authors tested different strategies for dropout placement and determined that a handful of dropout layers near the encoder-decoder bottleneck is better than simply using dropout near the output layer. Interestingly, Bayesian SegNet improves the accuracy over vanilla SegNet. Their confidence maps shown high uncertainty near object boundaries, but different test-time dropout schemes could provide a more diverse set of uncertainty estimates.

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding Alex Kendall, Vijay Badrinarayanan, Roberto Cipolla, in arXiv:1511.02680, November 2015. [project page with videos]

Confidences are quite useful for evaluation purposes, because instead of providing a single average result across all pixels in all images, we can sort the pixels and/or images by the overall confidence in prediction. When evaluation the top 10% most confident pixels, we should expect significantly higher performance. For example, the Bayesian SegNet approach achieves 75.4% global accuracy on the SUN RGBD dataset, and an astonishing 97.6% on most confident 10% of the test-set [personal communication with Bayesian SegNet authors]. This kind of sort-by-confidence evaluation was popularized by the PASCAL VOC Object Detection Challenge, where precision/recall curves were the norm. Unfortunately, as the research community moved towards large-scale classification, the notion of confidence was pushed aside. Until now.

Theoretical Bayesian Deep Learning

Deep networks that model uncertainty are truly meaningful machine learning systems. It ends up that we don't really have to understand how a deep network's neurons process image features to trust the system to make decisions. As long as the model provides uncertainty estimates, we'll know when the model is struggling. This is particularly important when your network is given inputs that are far from the training data.

The Gaussian Process: A machine learning approach with built-in uncertainty modeling

In a recent ICML 2016 paper, Yarin Gal and Zoubin Ghahramani develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Gal's paper gives a complete theoretical treatment of the link between Gaussian processes and dropout, and develops the tools necessary to represent uncertainty in deep learning. They show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process. I have yet to see researchers use dropout between every layer, so the discrepancy between theory and practice suggests that more research is necessary.

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning Yarin Gal, Zoubin Ghahramani, in ICML. June 2016. [Appendix with relationship to Gaussian Processes]
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks Yarin Gal, in arXiv:1512.05287. May 2016.

What My Deep Model Doesn't Know. Yarin Gal. Blog Post. July 2015

Homoscedastic and Heteroscedastic Regression with Dropout Uncertainty. Yarin Gal. Blog Post. February 2016.

Test-time dropout is used to provide uncertainty estimates for deep learning systems.

In conclusion, maybe we can never get both interpretability and performance when it comes to deep learning systems. But, we can all agree that providing confidences, or uncertainty estimates, alongside predictions is always a good idea. Dropout, the very single regularization trick used to battle overfitting in deep models, shows up, yet again. Sometimes all you need is to add some random variations to your input, and average the results over many trials. Dropout lets you not only wiggle the network inputs but the entire architecture.

I do wonder what Yann LeCun thinks about Bayesian ConvNets... Last I heard, he was allergic to sampling.

Related Posts
Deep Learning vs Probabilistic Graphical Models vs Logic April 2015
Deep Learning Trends @ ICLR 2016 June 2016