Tuesday, November 19, 2019

Computer Vision and Visual SLAM vs. AI Agents

With all the recent advancements in end-to-end deep learning, it is now possible to train AI agents to perform many different tasks (some in simulation and some in the real-world). End-to-end learning allows one to replace a multi-component, hand-engineered system with a single learning network that can process raw sensor data and output actions for the AI to take in the physical world. I will discuss the implications of these ideas while highlighting some new research trends regarding Deep Learning for Visual SLAM and conclude with some predictions regarding the kinds of spatial reasoning algorithms that we will need in the future. 

In today's article, we will go over three ideas:

 I.) Does Computer Vision Matter for Action?
 II.) Visual SLAM for AI agents
 III.) Quō vādis Visual SLAM? Trends and research forecast

I. Does Computer Vision Matter for Action?

At last month's International Conference of Computer Vision (ICCV 2019), I heard the following thought-provoking question,
"What do Artificial Intelligence Agents need (if anything) from the field of Computer Vision?"
The question was posed by Vladlen Koltun (from Intel Research) during his talk at the Deep Learning for Visual SLAM Workshop at ICCV 2019 in Seoul. He spoke about building AI agents with and without the aid of computer vision to guide representation learning. While Koltun has worked on classical Visual SLAM (see his Direct Sparse Odometry (DSO) system [2]), at this workshop, he decided to not speak about his older work on geometry, alignment, or 3D point cloud processing. His talk included numerous ideas spanning several of his team's research papers, some humor (see video), and plenty of Koltun's philosophical views towards general artificial intelligence. 

Recent techniques show that it is possible to learn actions (the output quantities that we really want) from pixels (raw inputs) directly without any intermediate computer vision processing like object recognition, depth estimation, and segmentation. But just because it is possible to solve some AI tasks without intermediate representations (i.e., the computer vision stuff), does that mean that we should abandon computer vision research and let end-to-end learning take care of everything? Probably not.

From a very practical standpoint, let's ask the following question: 
"Is an agent who is aware of computer vision stuff more robust than an agent trained without intermediate representations?" 
Recent research from Koltun's lab [3] indicates that the answer is yes: training with intermediate representations, as done by supervision from per-frame computer vision tasks, gives rise to more robust agents that learn faster and are more robust in a variety of performance tasks! The next natural question is: which computer vision tasks matter most for agent robustness? Koltun's research suggests that depth estimation is one particular task that works well as an auxiliary task when training agents that have to move through space (i.e., most video games). A depth estimation network should help an AI agent navigate an unknown environment as depth estimation is one key component in many of today's RGBD Visual SLAM systems. The best way to learn about Koltun's paper, titled Does Computer Vision Matter for Action?, is to see the video on YouTube.

Video describing Koltun's Does Computer Vision Matter for Action? [3]

Let's imagine that you want to deploy a robot into the world sometimes from now until 2025 based on your large-scale AI agent training, and you're debating whether you should avoid intermediate representations or not. 

Intermediate representations facilitate explainability, debuggability, and testing. Explainability is a key to success when systems require spatial reasoning capabilities in the real-world. If your agents are misbehaving, take a look at their intermediate representations. If you want to improve your AI, you can analyze the computer vision systems to prioritize better your data collection effort. Visualization should be a first-order citizen in your deep learning toolbox.

But today's computer vision ecosystem offers more than algorithms that process individual images. Visual SLAM systems rapidly process images while updating the camera's trajectory and updating the 3D map of the world. Visual SLAM, or VSLAM, algorithms are the real-time variants of Structure-from-Motion (SfM), which has been around for a while. SfM uses bundle adjustment -- a minimization of reprojection error, usually solved with Levenberg Marquardt. If there any kind of robot you see moving around today (2019), it is likely that it is running some variant of SLAM (localization and mapping) and not an end-to-end trained network -- at least not today. So what does Visual SLAM mean for AI agents? 

II. Visual SLAM for AI Agents

While no single per-frame computer vision algorithm is close to sufficient to enable robust action in an environment, there is a class of real-time computer vision systems like Visual SLAM that can be used to guide agents through space. The Workshop on Deep Learning for Visual SLAM at ICCV 2019 showcased a variety of different Visual SLAM approaches and included a discussion panel. The workshop featured talks on Visual SLAM on mobile platforms (Victor Prisacariu from 6d.ai), autonomous cars (Daniel Cremers from TUM and ArtiSense.ai), high-detail indoor modeling (Angela Dai from TUM), AI Agents (Vladlen Koltun from Intel Research) and mixed-reality (Tomasz Malisiewicz from Magic Leap). 

2nd Workshop on Deep Learning for Visual SLAM
Teaser Image for the 2nd Workshop on Deep Learning for Visual SLAM from Ronnie Clark.
See info at http://visualslam.ai
When it comes to spatial perception capabilities, Koltun's talk made it clear that we, as computer vision researchers, could think bolder. There is a spectrum of spatial perception capabilities that AI agents need that only somewhat overlaps with traditional Visual SLAM (whether deep learning-based or not).

Koltun's work is in favor of using intermediate representations based on computer vision to produce more robust AI agents. However, Koltun is not convinced that 6dof Visual SLAM, as is currently defined, needs to be solved for AI agents. Let's consider ordinary human tasks like walking, washing your hands, and flossing your teeth -- each one requires a different amount of spatial reasoning abilities. It is reasonable to assume that AI agents would need varying degrees of spatial localization and mapping capabilities to perform such tasks.

Visual SLAM techniques, like the ones used inside Augmented Reality systems, build metric 3D maps of the environment for the task of high-precision placement of digital content -- but such high-precision systems might never be used directly inside AI agents. When the camera is hand-held (augmented reality) or head-mounted (mixed reality), a human decides where to move. AI agents have to make their own movement decisions, and this requires more than feature correspondences and bundle adjustment -- more than what is inside the scope of computer vision.

Inside a head-mounted display, you might look at digital content 30 feet away from you, and for everything to look correct geometrically, you must have a decent 3D map of the world (spanning at least 30 feet) and a reasonable estimate of your pose. But for many tasks that AI agents need to perform, metric-level representations of far-away geometry are un-necessary. It is as if proper action requires local, high-quality metric maps and something coarser like topological maps for large-range maps. Visual SLAM systems (stereo-based and depth-sensor based) are likely to find numerous applications in industry such as mixed reality and some branches of robotics, where millimeter precision matters. 

More general end-to-end learning for AI agents will show us new kinds of spatial intelligence, automatically learned from data. There is a lot of exciting research to be done to answer questions like the following: What kind of tasks can we train Visual AI Agents for such that map-building and localization capabilities arise? Or What type of core spatial reasoning capabilities can we pre-build to enable further self-supervised learning from the 3D world?

III. Quō vādis Visual SLAM? Trends and research forecast

At the Deep Learning Workshop for Visual SLAM, an interesting question that came up in the panel focused on the convergence of methods in Visual SLAM. Or alternatively,
"Will a single Visual SLAM framework rule them all?"
The world of applied research is moving towards more deep learning -- by 2019, many of the critical tasks inside computer vision exist as some form of a (convolutional/graph) neural network. I don't believe that we will see a single SLAM framework/paradigm dominate all others -- I think we will see a plurality of Visual SLAM systems based on inter-changeable deep learning components. This new generation of deep learning-based components will allow more creative applications of end-to-end learning and be typically useful as modules within other real-world systems. We should create tools that will enable others to make better tools. 

PyTorch is making it easy to build multiple-view geometry tools like Kornia -- such that the right parts of computer vision are brought directly into today's deep learning ecosystem as first-order citizens. And PyTorch is winning over the world of research. A dramatic increase in usage happened from 2017 to 2019, with PyTorch now the recommended framework amongst most of my fellow researchers.

To take a look at what the end goal in terms of end-to-end deep learning for visual SLAM might look like, take a look at gradSLAM from Krishna Murthy, a Ph.D. student in MILA, and collaborators at CMU. Their paper offers a new way of thinking of SLAM as made up of differentiable blocks. From the article, "This amalgamation of dense SLAM with computational graphs enables us to backprop from 3D maps to 2D pixels, opening up new possibilities in gradient-based learning for SLAM."

Key Figure from the gradSLAM paper on end-to-end learning for SLAM.
Key Figure from the gradSLAM paper on end-to-end learning for SLAM. [5]

Another key trend that seems to be on the rise inside the context of Deep Visual SLAM is self-supervised learning. We are seeing more and more practical successes of self-supervised learning for multi-view problems where geometry enables us to get away from strong supervision. Even the ConvNet-based point detector SuperPoint [7], which my team and I developed at Magic Leap, uses self-supervision to train more robust interest point detectors. In our case, it was impossible to get ground truth interest points on images, and self-labeling was the only way out. One of my favorite researchers working on self-supervised techniques is Adrien Gaidon from TRI, who studies how such methods can be used to make smarter cars. Adrien gave some great talks at other ICCV 2019 Workshops related to autonomous vehicles, and his work is closely related to Visual SLAM and useful for anybody working on similar problems.

Adrien Gaidon's talk from October 11th, 2019 on Self-Supervised Learning in the context of Autonomous Cars

Another excellent presentation about this topic from Alyosha Efros. He does a great job convincing you why you should love self-supervision.

A presentation about self-supervision from Alyosha Efros on May 25th, 2018


As more and more spatial reasoning skills get baked into deep networks, we must face two opposing forces. On the one hand, specifying internal representations makes it difficult to scale to new tasks -- it is easier to trick the deep nets into doing all the hard work for you. On the other hand, we want interpretability and some amount of safety when we deploy AI agents into the real world, so some intermediate tasks like object recognition are likely to be involved in today's spatial perception recipe. Lots of exciting work is happening with multi-agents from OpenAI [6], but full end-to-end learning will not give real-world robots such as autonomous cars anytime soon.

Video from OpenAI showing Multi-Agent Hide and Seek. [6]   

More practical Visual SLAM research will focus on differentiable high-level blocks. As more deep learning happens in Visual SLAM, it will create a renaissance in Visual SLAM as sharing entire SLAM systems will be as easy as sharing CNNs today. I cannot wait until the following is possible:

pip install DeepSLAM

I hope you enjoyed learning about the different approaches to Visual SLAM, and that you have found my blog post insightful and educational. Until next time!

[1]. Vladlen Koltun. Chief Scientist for Intelligent Systems at Intel. http://vladlen.info/
[2]. Direct Sparse Odometry. Jakob Engel, Vladlen Koltun, and Daniel Cremers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 2018. http://vladlen.info/publications/direct-sparse-odometry/
[3]. Does Computer Vision Matter for Action? Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Science Robotics, 4(30), 2019. http://vladlen.info/publications/computer-vision-matter-action/
[4]. Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Winter Conference on Applications of Computer Vision, 2019. https://kornia.github.io/
[5]. gradSLAM: Dense SLAM meets Automatic Differentiation. Krishna Murthy J., Ganesh Iyer, and Liam Paull. In arXiv, 2019. http://montrealrobotics.ca/gradSLAM/
[6] Emergent tool use from multi-agent autocurricula. Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. In arXiv 2019. https://openai.com/blog/emergent-tool-use/
[7] SuperPoint: Self-supervised interest point detection and description. Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018. https://arxiv.org/abs/1712.07629