Showing posts with label conference. Show all posts
Showing posts with label conference. Show all posts

Tuesday, November 19, 2019

Computer Vision and Visual SLAM vs. AI Agents

With all the recent advancements in end-to-end deep learning, it is now possible to train AI agents to perform many different tasks (some in simulation and some in the real-world). End-to-end learning allows one to replace a multi-component, hand-engineered system with a single learning network that can process raw sensor data and output actions for the AI to take in the physical world. I will discuss the implications of these ideas while highlighting some new research trends regarding Deep Learning for Visual SLAM and conclude with some predictions regarding the kinds of spatial reasoning algorithms that we will need in the future. 




In today's article, we will go over three ideas:

 I.) Does Computer Vision Matter for Action?
 II.) Visual SLAM for AI agents
 III.) Quō vādis Visual SLAM? Trends and research forecast

I. Does Computer Vision Matter for Action?

At last month's International Conference of Computer Vision (ICCV 2019), I heard the following thought-provoking question,
"What do Artificial Intelligence Agents need (if anything) from the field of Computer Vision?"
The question was posed by Vladlen Koltun (from Intel Research) during his talk at the Deep Learning for Visual SLAM Workshop at ICCV 2019 in Seoul. He spoke about building AI agents with and without the aid of computer vision to guide representation learning. While Koltun has worked on classical Visual SLAM (see his Direct Sparse Odometry (DSO) system [2]), at this workshop, he decided to not speak about his older work on geometry, alignment, or 3D point cloud processing. His talk included numerous ideas spanning several of his team's research papers, some humor (see video), and plenty of Koltun's philosophical views towards general artificial intelligence. 

Recent techniques show that it is possible to learn actions (the output quantities that we really want) from pixels (raw inputs) directly without any intermediate computer vision processing like object recognition, depth estimation, and segmentation. But just because it is possible to solve some AI tasks without intermediate representations (i.e., the computer vision stuff), does that mean that we should abandon computer vision research and let end-to-end learning take care of everything? Probably not.

From a very practical standpoint, let's ask the following question: 
"Is an agent who is aware of computer vision stuff more robust than an agent trained without intermediate representations?" 
Recent research from Koltun's lab [3] indicates that the answer is yes: training with intermediate representations, as done by supervision from per-frame computer vision tasks, gives rise to more robust agents that learn faster and are more robust in a variety of performance tasks! The next natural question is: which computer vision tasks matter most for agent robustness? Koltun's research suggests that depth estimation is one particular task that works well as an auxiliary task when training agents that have to move through space (i.e., most video games). A depth estimation network should help an AI agent navigate an unknown environment as depth estimation is one key component in many of today's RGBD Visual SLAM systems. The best way to learn about Koltun's paper, titled Does Computer Vision Matter for Action?, is to see the video on YouTube.

Video describing Koltun's Does Computer Vision Matter for Action? [3]

Let's imagine that you want to deploy a robot into the world sometimes from now until 2025 based on your large-scale AI agent training, and you're debating whether you should avoid intermediate representations or not. 

Intermediate representations facilitate explainability, debuggability, and testing. Explainability is a key to success when systems require spatial reasoning capabilities in the real-world. If your agents are misbehaving, take a look at their intermediate representations. If you want to improve your AI, you can analyze the computer vision systems to prioritize better your data collection effort. Visualization should be a first-order citizen in your deep learning toolbox.

But today's computer vision ecosystem offers more than algorithms that process individual images. Visual SLAM systems rapidly process images while updating the camera's trajectory and updating the 3D map of the world. Visual SLAM, or VSLAM, algorithms are the real-time variants of Structure-from-Motion (SfM), which has been around for a while. SfM uses bundle adjustment -- a minimization of reprojection error, usually solved with Levenberg Marquardt. If there any kind of robot you see moving around today (2019), it is likely that it is running some variant of SLAM (localization and mapping) and not an end-to-end trained network -- at least not today. So what does Visual SLAM mean for AI agents? 

II. Visual SLAM for AI Agents

While no single per-frame computer vision algorithm is close to sufficient to enable robust action in an environment, there is a class of real-time computer vision systems like Visual SLAM that can be used to guide agents through space. The Workshop on Deep Learning for Visual SLAM at ICCV 2019 showcased a variety of different Visual SLAM approaches and included a discussion panel. The workshop featured talks on Visual SLAM on mobile platforms (Victor Prisacariu from 6d.ai), autonomous cars (Daniel Cremers from TUM and ArtiSense.ai), high-detail indoor modeling (Angela Dai from TUM), AI Agents (Vladlen Koltun from Intel Research) and mixed-reality (Tomasz Malisiewicz from Magic Leap). 


2nd Workshop on Deep Learning for Visual SLAM
Teaser Image for the 2nd Workshop on Deep Learning for Visual SLAM from Ronnie Clark.
See info at http://visualslam.ai
When it comes to spatial perception capabilities, Koltun's talk made it clear that we, as computer vision researchers, could think bolder. There is a spectrum of spatial perception capabilities that AI agents need that only somewhat overlaps with traditional Visual SLAM (whether deep learning-based or not).

Koltun's work is in favor of using intermediate representations based on computer vision to produce more robust AI agents. However, Koltun is not convinced that 6dof Visual SLAM, as is currently defined, needs to be solved for AI agents. Let's consider ordinary human tasks like walking, washing your hands, and flossing your teeth -- each one requires a different amount of spatial reasoning abilities. It is reasonable to assume that AI agents would need varying degrees of spatial localization and mapping capabilities to perform such tasks.

Visual SLAM techniques, like the ones used inside Augmented Reality systems, build metric 3D maps of the environment for the task of high-precision placement of digital content -- but such high-precision systems might never be used directly inside AI agents. When the camera is hand-held (augmented reality) or head-mounted (mixed reality), a human decides where to move. AI agents have to make their own movement decisions, and this requires more than feature correspondences and bundle adjustment -- more than what is inside the scope of computer vision.

Inside a head-mounted display, you might look at digital content 30 feet away from you, and for everything to look correct geometrically, you must have a decent 3D map of the world (spanning at least 30 feet) and a reasonable estimate of your pose. But for many tasks that AI agents need to perform, metric-level representations of far-away geometry are un-necessary. It is as if proper action requires local, high-quality metric maps and something coarser like topological maps for large-range maps. Visual SLAM systems (stereo-based and depth-sensor based) are likely to find numerous applications in industry such as mixed reality and some branches of robotics, where millimeter precision matters. 

More general end-to-end learning for AI agents will show us new kinds of spatial intelligence, automatically learned from data. There is a lot of exciting research to be done to answer questions like the following: What kind of tasks can we train Visual AI Agents for such that map-building and localization capabilities arise? Or What type of core spatial reasoning capabilities can we pre-build to enable further self-supervised learning from the 3D world?

III. Quō vādis Visual SLAM? Trends and research forecast

At the Deep Learning Workshop for Visual SLAM, an interesting question that came up in the panel focused on the convergence of methods in Visual SLAM. Or alternatively,
"Will a single Visual SLAM framework rule them all?"
The world of applied research is moving towards more deep learning -- by 2019, many of the critical tasks inside computer vision exist as some form of a (convolutional/graph) neural network. I don't believe that we will see a single SLAM framework/paradigm dominate all others -- I think we will see a plurality of Visual SLAM systems based on inter-changeable deep learning components. This new generation of deep learning-based components will allow more creative applications of end-to-end learning and be typically useful as modules within other real-world systems. We should create tools that will enable others to make better tools. 

PyTorch is making it easy to build multiple-view geometry tools like Kornia -- such that the right parts of computer vision are brought directly into today's deep learning ecosystem as first-order citizens. And PyTorch is winning over the world of research. A dramatic increase in usage happened from 2017 to 2019, with PyTorch now the recommended framework amongst most of my fellow researchers.

To take a look at what the end goal in terms of end-to-end deep learning for visual SLAM might look like, take a look at gradSLAM from Krishna Murthy, a Ph.D. student in MILA, and collaborators at CMU. Their paper offers a new way of thinking of SLAM as made up of differentiable blocks. From the article, "This amalgamation of dense SLAM with computational graphs enables us to backprop from 3D maps to 2D pixels, opening up new possibilities in gradient-based learning for SLAM."


Key Figure from the gradSLAM paper on end-to-end learning for SLAM.
Key Figure from the gradSLAM paper on end-to-end learning for SLAM. [5]

Another key trend that seems to be on the rise inside the context of Deep Visual SLAM is self-supervised learning. We are seeing more and more practical successes of self-supervised learning for multi-view problems where geometry enables us to get away from strong supervision. Even the ConvNet-based point detector SuperPoint [7], which my team and I developed at Magic Leap, uses self-supervision to train more robust interest point detectors. In our case, it was impossible to get ground truth interest points on images, and self-labeling was the only way out. One of my favorite researchers working on self-supervised techniques is Adrien Gaidon from TRI, who studies how such methods can be used to make smarter cars. Adrien gave some great talks at other ICCV 2019 Workshops related to autonomous vehicles, and his work is closely related to Visual SLAM and useful for anybody working on similar problems.

Adrien Gaidon's talk from October 11th, 2019 on Self-Supervised Learning in the context of Autonomous Cars

Another excellent presentation about this topic from Alyosha Efros. He does a great job convincing you why you should love self-supervision.

A presentation about self-supervision from Alyosha Efros on May 25th, 2018


Conclusion

As more and more spatial reasoning skills get baked into deep networks, we must face two opposing forces. On the one hand, specifying internal representations makes it difficult to scale to new tasks -- it is easier to trick the deep nets into doing all the hard work for you. On the other hand, we want interpretability and some amount of safety when we deploy AI agents into the real world, so some intermediate tasks like object recognition are likely to be involved in today's spatial perception recipe. Lots of exciting work is happening with multi-agents from OpenAI [6], but full end-to-end learning will not give real-world robots such as autonomous cars anytime soon.



Video from OpenAI showing Multi-Agent Hide and Seek. [6]   


More practical Visual SLAM research will focus on differentiable high-level blocks. As more deep learning happens in Visual SLAM, it will create a renaissance in Visual SLAM as sharing entire SLAM systems will be as easy as sharing CNNs today. I cannot wait until the following is possible:


pip install DeepSLAM

I hope you enjoyed learning about the different approaches to Visual SLAM, and that you have found my blog post insightful and educational. Until next time!


References:
[1]. Vladlen Koltun. Chief Scientist for Intelligent Systems at Intel. http://vladlen.info/
[2]. Direct Sparse Odometry. Jakob Engel, Vladlen Koltun, and Daniel Cremers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 2018. http://vladlen.info/publications/direct-sparse-odometry/
[3]. Does Computer Vision Matter for Action? Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Science Robotics, 4(30), 2019. http://vladlen.info/publications/computer-vision-matter-action/
[4]. Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Winter Conference on Applications of Computer Vision, 2019. https://kornia.github.io/
[5]. gradSLAM: Dense SLAM meets Automatic Differentiation. Krishna Murthy J., Ganesh Iyer, and Liam Paull. In arXiv, 2019. http://montrealrobotics.ca/gradSLAM/
[6] Emergent tool use from multi-agent autocurricula. Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. In arXiv 2019. https://openai.com/blog/emergent-tool-use/
[7] SuperPoint: Self-supervised interest point detection and description. Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018. https://arxiv.org/abs/1712.07629







Tuesday, December 08, 2015

ICCV 2015: Twenty one hottest research papers

"Geometry vs Recognition" becomes ConvNet-for-X

Computer Vision used to be cleanly separated into two schools: geometry and recognition. Geometric methods like structure from motion and optical flow usually focus on measuring objective real-world quantities like 3D "real-world" distances directly from images and recognition techniques like support vector machines and probabilistic graphical models traditionally focus on perceiving high-level semantic information (i.e., is this a dog or a table) directly from images.

The world of computer vision is changing fast has changed. We now have powerful convolutional neural networks that are able to extract just about anything directly from images. So if your input is an image (or set of images), then there's probably a ConvNet for your problem.  While you do need a large labeled dataset, believe me when I say that collecting a large dataset is much easier than manually tweaking knobs inside your 100K-line codebase. As we're about to see, the separation between geometric methods and learning-based methods is no longer easily discernible.

By 2016 just about everybody in the computer vision community will have tasted the power of ConvNets, so let's take a look at some of the hottest new research directions in computer vision.

ICCV 2015's Twenty One Hottest Research Papers



This December in Santiago, Chile, the International Conference of Computer Vision 2015 is going to bring together the world's leading researchers in Computer Vision, Machine Learning, and Computer Graphics.

To no surprise, this year's ICCV is filled with lots of ConvNets, but this time the applications of these Deep Learning tools are being applied to much much more creative tasks. Let's take a look at the following twenty one ICCV 2015 research papers, which will hopefully give you a taste of where the field is going.


1. Ask Your Neurons: A Neural-Based Approach to Answering Questions About Images Mateusz Malinowski, Marcus Rohrbach, Mario Fritz


"We propose a novel approach based on recurrent neural networks for the challenging task of answering of questions about images. It combines a CNN with a LSTM into an end-to-end architecture that predict answers conditioning on a question and an image."




2. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler



"To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book."







3. Learning to See by Moving Pulkit Agrawal, Joao Carreira, Jitendra Malik


"We show that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt using class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching."







4. Local Convolutional Features With Unsupervised Training for Image Retrieval Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, Cordelia Schmid



"We introduce a deep convolutional architecture that yields patch-level descriptors, as an alternative to the popular SIFT descriptor for image retrieval."






5. Deep Networks for Image Super-Resolution With Sparse Prior Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, Thomas Huang



"We show that a sparse coding model particularly designed for super-resolution can be incarnated as a neural network, and trained in a cascaded structure from end to end."



6. High-for-Low and Low-for-High: Efficient Boundary Detection From Deep Object Features and its Applications to High-Level Vision Gedas Bertasius, Jianbo Shi, Lorenzo Torresani



"In this work we show how to predict boundaries by exploiting object level features from a pretrained object-classification network."















7. A Deep Visual Correspondence Embedding Model for Stereo Matching Costs Zhuoyuan Chen, Xun Sun, Liang Wang, Yinan Yu, Chang Huang



"A novel deep visual correspondence embedding model is trained via Convolutional Neural Network on a large set of stereo images with ground truth disparities. This deep embedding model leverages appearance data to learn visual similarity relationships between corresponding image patches, and explicitly maps intensity values into an embedding feature space to measure pixel dissimilarities."





8. Im2Calories: Towards an Automated Mobile Vision Food Diary Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, Kevin P. Murphy



"We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories."









9. Unsupervised Visual Representation Learning by Context Prediction Carl Doersch, Abhinav Gupta, Alexei A. Efros



"How can one write an objective function to encourage a representation to capture, for example, objects, if none of the objects are labeled?"
















10. Deep Neural Decision Forests Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, Samuel Rota Bulò



"We introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network."






11. Conditional Random Fields as Recurrent Neural Networks Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, Philip H. S. Torr



"We formulate mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks."






12. Flowing ConvNets for Human Pose Estimation in Videos Tomas Pfister, James Charles, Andrew Zisserman



"We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow."





13. Dense Optical Flow Prediction From a Static Image Jacob Walker, Abhinav Gupta, Martial Hebert



"Given a static image, P-CNN predicts the future motion of each and every pixel in the image in terms of optical flow. Our P-CNN model leverages the data in tens of thousands of realistic videos to train our model. Our method relies on absolutely no human labeling and is able to predict motion based on the context of the scene."


14. DeepBox: Learning Objectness With Convolutional Networks Weicheng Kuo, Bharath Hariharan, Jitendra Malik



"Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method."








15. Active Object Localization With Deep Reinforcement Learning Juan C. Caicedo, Svetlana Lazebnik



"This agent learns to deform a bounding box using simple transformation actions, with the goal of determining the most specific location of target objects following top-down reasoning."





16. Predicting Depth, Surface Normals and Semantic Labels With a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus



"We address three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling."















17. HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu



"We introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers."





18. FlowNet: Learning Optical Flow With Convolutional Networks Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox



"We construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task."







19. Understanding Deep Features With Computer-Generated Imagery Mathieu Aubry, Bryan C. Russell


"Rendered images are presented to a trained CNN and responses for different layers are studied with respect to the input scene factors."







20. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization Alex Kendall, Matthew Grimes, Roberto Cipolla



"Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation."





21. Visual Tracking With Fully Convolutional Networks Lijun Wang, Wanli Ouyang, Xiaogang Wang, Huchuan Lu




"A new approach for general object tracking with fully convolutional neural network."



Conclusion

While some can argue that the great convergence upon ConvNets is making the field less diverse, it is actually making the techniques easier to comprehend. It is easier to "borrow breakthrough thinking" from one research direction when the core computations are cast in the language of ConvNets. Using ConvNets, properly trained (and motivated!) 21 year old graduate student are actually able to compete on benchmarks, where previously it would take an entire 6-year PhD cycle to compete on a non-trivial benchmark.

See you next week in Chile!


Update (January 13th, 2016)

The following awards were given at ICCV 2015.

Achievement awards

  • PAMI Distinguished Researcher Award (1): Yann LeCun
  • PAMI Distinguished Researcher Award (2): David Lowe
  • PAMI Everingham Prize Winner (1): Andrea Vedaldi for VLFeat
  • PAMI Everingham Prize Winner (2): Daniel Scharstein and Rick Szeliski for the Middlebury Datasets

Paper awards

  • PAMI Helmholtz Prize (1): David MartinCharles FowlkesDoron Tal, and Jitendra Malik for their ICCV 2001 paper "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics".
  • PAMI Helmholtz Prize (2): Serge BelongieJitendra Malik, and Jan Puzicha, for their ICCV 2001 paper "Matching Shapes".
  • Marr Prize: Peter KontschiederMadalina FiterauAntonio Criminisi, and Samual Rota Bulo, for "Deep Neural Decision Forests".
  • Marr Prize honorable mention: Saining Xie and Zhuowen Tu for"Holistically-Nested Edge Detection".
For more information about awards, see Sebastian Nowozin's ICCV-day-2 blog post.

I also wrote another ICCV-related blog post (January 13, 2016) about the Future of Real-Time SLAM.

Wednesday, June 26, 2013

[Awesome@CVPR2013] Scene-SIRFs, Sketch Tokens, Detecting 100,000 object classes, and more

I promised to blog about some more exciting papers at CVPR 2013, so here is a short list of a few papers which stood out.  This list also include this year's award winning paper: Fast, Accurate Detection of 100,000 Object Classes on a Single Machine.  Congrats Google Research on the excellent paper!



This paper uses ideas from Abhinav Gupta's work on 3D scene understanding as well as Ali Farhadi's work on visual phrases; however, it also uses RGB-D input data (like many other CVPR 2013 papers).

W. Choi, Y. -W. Chao, C. Pantofaru, S. Savarese. "Understanding Indoor Scenes Using 3D Geometric Phrases" in CVPR, 2013. [pdf]

This paper shows a uses the crowd to learn which parts of birds are useful for fine-grained categorization.  If you work on fine-grained categorization or run experiments with MTurk, then you gotta check this out!
Fine-Grained Crowdsourcing for Fine-Grained Recognition. Jia Deng, Jonathan Krause, Li Fei-Fei. CVPR, 2013. [ pdf ]

This paper won the best paper award.  Congrats Google Research!

Fast, Accurate Detection of 100,000 Object Classes on a Single Machine. Thomas Dean, Mark Ruzon, Mark Segal, Jon Shlens, Sudheendra Vijayanarasimhan, Jay Yagnik. CVPR, 2013 [pdf]


The following is the Scene-SIRFs paper, which I thought was one of the best papers at this year's CVPR.  The ideas to to decompose an input image into intrinsic images using Barron's algorithm which was initially shown to work on objects, but now is being applied to realistic scenes.

Intrinsic Scene Properties from a Single RGB-D Image. Jonathan T. Barron, Jitendra Malik. CVPR, 2013 [pdf]


This is a graph-based localization paper which uses a sort of "Visual Memex" to solve the problem.
Graph-Based Discriminative Learning for Location Recognition. Song Cao, Noah Snavely. CVPR, 2013. [pdf]


This paper provides an exciting new way of localizing contours in images which is orders of magnitude faster than the gPb.  There is code available, so the impact is likely to be high.

Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection. Joseph J. Lim, C. Lawrence Zitnick, and Piotr Dollar. CVPR 2013. [ pdf ] [code@github]

Tuesday, June 18, 2013

Must-see Workshops @ CVPR 2013

June is that wonderful month during which computer vision researchers, students, and entrepreneurs go to CVPR -- the premier yearly Computer Vision conference.  Whether you are presenting a paper, learning about computer vision, networking with academic colleagues, looking for rock-star vision experts to join your start-up, or looking for rock-star vision start-ups to join, CVPR is where all of the action happens!  If you're not planning on going, it is not too late! The Conference starts next week in Portland, Oregon.


There are lots of cool papers at CVPR, many which I have already studied in great detail, and many others which I will learn about next week.  I will write about some of the cool papers/ideas I encounter while I'm at CVPR next week.  In addition to the main conference, CVPR has 3 action-packed workshop days.  I want to take this time to mention two super-cool workshops which are worth checking out during CVPR 2013.  Workshop talks are generally better than the main conference talks, since the invited speakers tend to be more senior and they get to present a broader view of their research (compared to the content of a single 8-page research paper as is typically discussed during the main conference).

SUNw: Scene Understanding Workshop
Sunday June 23, 2013


From the webpage: Scene understanding started with the goal of building machines that can see like humans to infer general principles and current situations from imagery, but it has become much broader than that. Applications such as image search engines, autonomous driving, computational photography, vision for graphics, human machine interaction, were unanticipated and other applications keep arising as scene understanding technology develops. As a core problem of high level computer vision, while it has enjoyed some great success in the past 50 years, a lot more is required to reach a complete understanding of visual scenes.

I attended some the other SUN workshops which were held at MIT during the winter months.  This time around, the conference is at CVPR, so by definition it will be accessible to more researchers.  Even though I have the pleasure of knowing personally the super-smart workshop organizers (Jianxiong Xiao, Aditya Khosla, James Hays, and Derek Hoiem), the most exciting tidbit about this workshop is the all-star invited speaker schedule.  The speakers include: Ali Farhadi, Yann LeCun, Fei-Fei Li, Aude Oliva, Deva Ramanan, Silvio Savarese, Song-Chun Zhu, and Larry Zitnick.  To hear some great talks and hear about truly bleeding-edge research by some of vision's most talented researchers, come to SUNw.

VIEW 2013: Vision Industry and Entrepreneur Workshop
Monday, June 24, 2013



From the webpage: Once largely an academic discipline, computer vision today is also a commercial force. Startups and global corporations are building businesses based on computer vision technology. These businesses provide computer vision based solutions for the needs of consumers, enterprises in many commercial sectors, non-profits, and governments. The demand for computer vision based solutions is also driving commercial and open-source development in associated areas, including hardware and software platforms for embedded and cloud deployments, new camera designs, new sensing technologies, and compelling applications. Last year, we introduced the IEEE Vision Industry and Entrepreneur Workshop (VIEW) at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) to bridge the academic and commercial worlds of computer vision. 

I include this workshop in the must-see list because the time is right for Compter Vision researchers to start innovating at start-ups.  First of all, the world wants your vision-based creations today.  With the availability of smart phones and widespread broadband access, the world does not want to wait a decade until the full academic research pipeline gets translated into products.  Seeing such workshops at CVPR is exciting, because this will help breed a new generation of researcher/entrepreneur.  I, for one, welcome our new company-starting computer vision overlords.