Tombone's Computer Vision Blog: Microsoft

Showing posts with label Microsoft. Show all posts

Saturday, November 07, 2015

The Deep Learning Gold Rush of 2015

In the last few decades, we have witnessed major technological innovations such as personal computers and the internet finally reach the mainstream. And with mobile devices and social networks on the rise, we're now more connected than ever. So what's next? When is it coming? And how will it change our lives? Today I'll tell you that the next big advance is well underway and it's being fueled by a recent technique in the field of Artificial Intelligence known as Deep Learning.

The California Gold Rush of 2015 is all about Deep Learning.

It's everywhere, you just don't know how to look.

All of today's excitement in Artificial Intelligence and Machine Learning stems from ground-breaking results in speech and visual object recognition using Deep Learning[1]. These algorithms are being applied to all sorts of data, and the learned deep neural networks outperform traditional expert systems carefully designed by scientists and engineers. End-to-end learning of deep representations from raw data is now possible due to a handful of well-performing deep learning recipes (ConvNets, Dropout, ReLUs, LSTM, DQN, ImageNet). But if there's one final takeaway that we can extract from decades of machine learning research, is that for many problems going deep isn't a choice, it's often a requirement.

Most of the apps and services you're already using (AirBnB, Snapchat, Twitch.tv, Uber, Yelp, LinkedIn, etc) are quite data-hungry and before you know it, they're all going to go mega-deep. So whether you need to revitalize your data science team with deep learning or you're starting an AI-from-day-one operation, it's pretty clear that everybody is rushing to get some of this Silicon Valley Gold.

From Titans to Gold Miners: Your atypical Gold Rush

Like all great gold rushes, this movement is led by new faces, which are pouring into Silicon Valley like droves. But these aren't your typical unskilled immigrants willing to pick up a hammer, nor your fresh computer science grads with some app-writing skills. The key deep learning players of today (known as the Titans of Deep Learning) are computer science professors and researchers (seldom born in the USA) leaving their academic posts and bringing their students and ideas straight into Silicon Valley.

"Turn on, Tune in, Dropout" -- Timothy Leary

Recently, Google and Facebook announced that their operations are now being powered by Deep Learning [2,3]. And with most Deep Learning Titans representing the tech giants (Yann LeCun at Facebook Research, Geoffrey Hinton at Google, Andrew Ng at Baidu), Deep Learning is likely to become one of the most sought after tech skills. With Toyota to invest in $1 Billion in Robotics and Artificial Intelligence Research (November 6, 2015), the announcement of YC Research (October 7, 2015), and the new Google Brain Residency Program "Pre-doc" AI jobs (October 26, 2015), Silicon Valley just got a whole lot more interesting.

Silicon Valley re-defines itself, yet again

To understand why it took so long for Deep Learning to take-off, let's take a brief look at the key technologies which defined Silicon Valley over the last 50 years. The following timeline gives an overview of where Silicon Valley has been and where it's going.

Figure Adapted from Steve Blank's Secret History of Silicon Valley

1970s: Semiconductors
The story of the digital-era starts with semiconductors. "Silicon" in "Silicon Valley" originally referred to the silicon chip or integrated circuit innovations as well as the location (close to Stanford) of much tech-related activity. The dominant firm from that time period was Fairchild Semiconductor International and it eventually gave rise to more recognizable companies like Intel. For a more detailed discussion of this birthing era, take a look at Steve Blank's Secret History of Silicon Valley[4].

Read more about Fairchild at TechCrunch's First Trillion-Dollar Startup

1980s: Personal Computers
Initially computers were quite large and used solely by research labs, government, and big businesses. But it was the personal computer which turned computer programming from a hobby into a vital skill. You no longer needed to be an MIT student to program on one of these badboys. While both Microsoft and Apple were founded in 1975 and 1976, respectively, they persevered due to their pioneering work in graphical user interfaces. This was the birth of the modern user-friendly Operating System. IBM approached Microsoft in 1980, regarding its upcoming personal computer, and from then on Microsoft would be King for a very long time.

See Mac-history's article on Microsoft's relationship with Apple

1990s: Internet
While the nerds at Universities were posting ascii messages on newsgroups in the 90s, service providers in the 1990s like AOL helped make the internet accessible to everyone. Remember getting all those AOL disks in the mail? Buying a chunk of digital real state (your own domain name) became possible and anybody with a dial up connection and some primitive text/HTML skills could start posting online content. With a mission statement like "organize the world's information", it was eventually Google that got the most of out the late 90s dot-com bubble, and remains a very strong player in all things tech.

2000s: Mobile and Social
While the dot-com bubble was about creating an online presence for startups and established companies, the way we use the internet has dramatically changed since 2001. A ton of new social communities have emerged, and due to Facebook we're now stars in our own reality show. Social and advertising have essentially turned the modern internet into a mainstream TV-like experience. The internet is no longer only for the nerds. The kings of this era (Google and Facebook) are also the biggest players in the Deep Learning space, because they have the largest user bases and in-house apps which can benefit most from machine learning.

2010-2015: Deep Learning comes to the party
Spend more than a day in Silicon Valley and you'll hear the popular expression, "Software is eating the world." Rampant spreading of software was only possible once the internet (1990s) AND mobile devices (2000s) became essential parts of our lives. No longer do we physically mail floppy disks, and social media fuels any app that goes viral. What traditional software is missing (or has been missing up until now) is the ability to improve over time from everyday use. If that same software is able to connect to a large Deep Learning system and start improving, then we have a game-changer on our hands. This is already happening with online advertising, digital assistants like Siri, and smart auto-responders like Google's new email auto-reply feature.

The hierarchical award-winning "AlexNet" Deep Learning architecture

Visualized using MIT's Toolbox for Deep Learning Neuron Visualization

Massive hiring of deep learning experts by the leading tech companies has only begun, but we also should be on the lookout for new ventures built on top of Deep Learning, not just a revitalization of last decade's successes. On this front, keep a close look at the following Deep Learning Cloud Service upstarts: Richard Socher from MetaMind, Matthew Zeiler from Clarifai, and Carlos Guestrin from Dato.

2015-2020: Deep Learning Revitalizes Robotics
Recently it has been shown that Deep Learning can be used to help robots learn tasks involving movement, object manipulation, and decision making[6,7,8,9]. Before Deep Learning, lots of different pieces of robotic software and hardware would have to be developed independently and then hacked together for demo day. Today, you can use one of a handful of "Deep Learning for Robotics recipes" and start watching your robot learn the task you care about.

Robots Learns to Grasp using Deep Learning at Carnegie Mellon University.

Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours [6]

With their 2013 acquisition of Boston Dynamics (a hardware play), 2014 acquisition of DeepMind (a software play), and a serious autonomous car play, Google is definitely early to the Robotics party. But the noteworthy bits are happening at the intersection of deep learning and robotics. I suggest taking a closer look at the Robotics research of Pieter Abbeel of Berkeley, Abhinav Gupta of Carnegie Mellon, and Ashutosh Saxena of Stanford -- all likely stars in the next Deep Learning for Robotics race. As long as Rodney Brooks keeps creating innovative Robotics platforms like Baxter, my expectations for Robotics are off the charts.

Conclusion

Unlike in 1849, the Deep Learning Gold Rush of 2015 is not going to bring some 300,000 gold-seekers in boats to California's mainland. This isn't a bring-your-own-hammer kind of game -- the Titans have already descended from their Ivory Towers and handed us ample mining tools. But it won't hurt to gain some experience with traditional "shallow" machine learning techniques so you can appreciate the power of Deep Learning.

I hope you enjoyed today's read and have a better sense of how Silicon Valley is undergoing a transformation. And remember, today's wave of Deep Learning upstart CEOs have PhDs, but once Deep Learning software becomes more user-friendly (TensorFlow?), maybe you won't have to wait so long to dropout.

References

[1] Krizhevsky, A., Sutskever, I. and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS 2012.
[2] D'Onfro, J. Google is 're-thinking' all of its products to include machine learning. Business Insider. October 22, 2015.
[3] D'Onfro, J. How Facebook will use artificial intelligence to organize insane amounts of data into the perfect News Feed and a personal assistant with superpowers. Business Insider. November 3, 2015.
[4] Blank, S. Secret History of Silicon Valley. 2008.
[5] Donglai Wei, Bolei Zhou, Antonio Torralba William T. Freeman. mNeuron: A Matlab Plugin to Visualize Neurons from Deep Models. 2015.
[6] Lerrel Pinto, Abhinav Gupta. Supersizing Self-supervision: Learning to Graspfrom 50K Tries and 700 Robot Hours. arXiv. 2015.
[7] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. In RSS 2015.
[8] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
[9] Ian Lenz, Ross Knepper, and Ashutosh Saxena. DeepMPC: Learning Deep Latent Features for Model Predictive Control. In Robotics Science and Systems (RSS), 2015

Friday, December 06, 2013

Brand Spankin' New Vision Papers from ICCV 2013

The International Conference of Computer Vision, ICCV, gathers the world's best researchers in Computer Vision and Machine Learning to showcase their newest and hottest ideas. (My work on the Exemplar-SVM debuted two years ago at ICCV 2011 in Barcelona.) This year, at ICCV 2013 in Sydney, Australia, the vision community witnessed lots of grand new ideas, excellent presentations, and gained new insights which are likely to influence the direction of vision research in the upcoming decade.

3D data is everywhere. Detectors are not only getting faster, but getting stylish. Edges are making a comeback. HOGgles let you see the world through the eyes of an algorithm. Computers can automatically make your face pictures more memorable. And why ever stop learning, when you can learn all day long?

Here is a breakdown of some of the must-read ICCV 2013 papers which I'd like to share with you:

From Large Scale Image Categorization to Entry-Level Categories, Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg, ICCV 2013.

This paper is the Marr Prize winning paper from this year's conference. It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked.

Structured Forests for Fast Edge Detection, P. Dollar and C. L. Zitnick, ICCV 2013.

This paper from Microsoft Research is all about pushing the boundaries for edge detection. Randomized Decision Trees and Forests have been used in lots of excellent Microsoft research papers, with Jamie Shotton's Kinect work being one of the best examples, and it is now being used for super high-speed edge detection. However this paper is not just about edges. Quoting the authors, "We describe a general purpose method for learning structured random decision forest that robustly uses structured labels to select splits in the trees." Anybody serious about learning for low-level vision should take a look.

There is also some code available, but take a very detailed look at the license before you use it in your project. It is not an MIT license.

HOGgles: Visualizing Object Detection Features, C. Vondrick, A. Khosla, T. Malisiewicz, A. Torralba. ICCV 2013.

"The real voyage of discovery consists not in seeking new landscapes but in having new eyes." — Marcel Proust

This is our MIT paper, which I already blogged about (Can you pass the HOGgles test?), so instead of rehashing what was already mentioned, I'll just leave you with the quote above. There are lots of great visualizations that Carl Vondrick put together on the HOGgles project webpage, so take a look.

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, ICCV 2013.

“Learn how to see. Realize that everything connects to everything else.” – Leonardo da Vinci

This paper is all about discovering how visual entities change as a function of time and space. One great example is how the appearance of cars has changed over the past several decades. Another example is how typical Google Street View images change as a function of going North-to-South in the United States. Surely the North looks different than the South -- we now have an algorithm that can automatically discover these precise differences.

By the way, congratulations on the move to Berkeley, Monsieur Efros. I hope your insatiable thirst for cultured life will not only be satisfied in the city which fostered your intellectual growth, but you will continue to inspire, educate, and motivate the next generation of visionaries.

NEIL: Extracting Visual Knowledge from Web Data. Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. In ICCV 2013. www.neil-kb.com

Fucking awesome! I don't normally use profanity in my blog, but I couldn't come up with a better phrase to describe the ideas presented in this paper. A computer program which runs 24/7 to collected visual data from the internet and continually learn what the world is all about. This is machine learning, this is AI, this is the future. None of this train on my favourite dataset, test on my favourite dataset bullshit. If there's anybody that's going to do it the right way, its the CMU gang. This paper gets my unofficial "Vision Award." Congratulations, Xinlei!

This sort of never-ending learning has been applied to text by Tom Mitchell's group (also from CMU), but this is the first, and serious, attempt at never-ending visual learning. The underlying algorithm is a semi-supervised learning algorithm which uses Google Image search to bootstrap the initial detectors, but eventually learns object-object relationships, object-attribute relationships, and scene-attribute relationships.

Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition. J. F. Henriques, J. Carreira, R. Caseiro, J. Batista. ICCV 2013.

Want faster detectors? Tired of hard-negative mining? Love all things Fourier? Then this paper is for you. Aren't you now glad you fell in love with linear algebra at a young age? This paper very clearly shows that there is a better way to perform hard-negative mining when the negatives are mined from translations of an underlying image pattern, as is typically done in object detection. The basic idea is simple, and that's why this paper wins the "thumbs-up from tombone" award. The crux of the derivation in the paper is the observation that the Gram matrix of a set of images and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure. Instead of incrementally mining negatives, in this paper they show that it is possible to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression. Move over hard-negative mining, Joseph Fourier just rocked your world.

P.S. Joao Carreira also created the CPMC image segmentation algorithm at CVPR 2010. A recent blog post from Piotr Dollár (December 10th, 2013), "A Seismic Shift in Object Detection" discusses how segmentation is coming back into vision in a big way.

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding, Scott Satkin and Martial Hebert. ICCV 2013.

A new way of matching images that come equipped with 3D data. Whether the data comes from Google Sketchup, or is the output of a Kinect-like scanner, more and more visual data comes with its own 3D interpretation. Unfortunately, most state-of-the-art image matching methods rely on comparing purely visual cues. This paper is based on an idea called "fine-grained geometry refinement" and allows the transfer of information across extreme viewpoint changes. While still computationally expensive, it allows non-parametric (i.e., data-driven) approaches to get away with using significantly smaller amounts of data.

Modifying the Memorability of Face Photographs. Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba and Aude Oliva, ICCV 2013.

Ever wanted to look more memorable in your photos? Maybe your ad-campaign could benefit from better face pictures which are more likely to stick in people's minds. Well, now there's an algorithm for that. Another great MIT paper, which the authors show that the memorability of photographs could not only be measured, but automatically enhanced!

SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. J. Xiao, A. Owens and A. Torralba. ICCV 2013. sun3d.cs.princeton.edu

Xiao et al, continue their hard-core data collection efforts. Now in 3D. In addition to collecting a vast dataset of 3D reconstructed scenes, they show that there are some kinds of errors that simply cannot be overcome with high-quality solvers. Some problems are too big and too ambitious (e.g., walking around an entire house with a Kinect) for even the best industrial-grade solvers (Google's Ceres solver) to tackle. In this paper, they show that a small amount of human annotation is all it takes to snap those reconstructions in place. And not any sort of crazy, click-here, click-there interfaces. Simple LabelMe-like annotation interfaces, which require annotating object polygons, can be used to create additional object-object constraints which help the solvers do their magic. For anybody interested in long-range scene reconstruction, take a look at their paper.

If there's one person I've ever seen that collects data while the rest of the world sleeps, it is definitely Prof. Xiao. Congratulations on the new faculty position! Princeton has been starving for a person like you. If anybody is looking for PhD/Masters/postdoc positions, and wants to work alongside one the most ambitious and driven upcoming researchers in vision (Prof. Xiao), take a look at his disclaimer/call for students/postdocs at Princeton, then apply to the program directly. Did I mention that you probably have to be a hacker/scientist badass to land a position in his lab?

Other noteworthy papers:

Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation. B. Fernando, T. Tuytelaars, ICCV 2013.

Training Deformable Part Models with Decorrelated Features. R. Girshick, J. Malik, ICCV 2013.

Sorry if I missed your paper, there were just too many good ones to list. For those of you still in Sydney, be sure to either take a picture of a Kangaroo, or eat one.

Monday, October 28, 2013

Just Add Vision: Turning Computers Into Robots

The future of technology is all about improving the human experience. And the human experience is all about you -- you filling your life with less tedious work, more fun, less discomfort, and more meaningful human interactions. Whether new technology will let us enjoy life more during our spare time (think of what big screen TVs did for entertainment), or, let us become more productive at work (think of what calculators did for engineers), successful technologies have the tendency to improve our quality of life.

Let’s take a quick look at how things got started...

IBM started a chain of events by building affordable computers for small businesses to increase their productivity. Microsoft and Apple then created easy-to-use operating systems which allowed the common man to use computers at home for both entertainment (computer games) and being more productive (MS Office). Once personal computers started entering our homes, it was only a matter of a years until broadband internet access become widespread. Google then came along and changed the way we retrieve information from the internet while Social networking redefined how we interact with the people in our lives. Let's not forget modern smartphones, which let us use all of this amazing technology while on the go!

Surely our iPhones will get faster and smaller while Google search will become more robust, but does the way we interact with these devices have to stay the same? And will these devices always do the same things?

Computers without keyboards
A lot of the world’s most exciting technology is designed to be used directly by people and ceases to provide much value once we stop directly interacting with our devices. I honestly believe that instead of wearing more computing devices (such as Google Glass) and learning new iOS commands, what we need is technology that can do useful things on its own, without requiring a person to hit buttons or custom keyboards. Because doing useful things entails having some sort of computational unit inside, it is fair to think of these future devices as “computers.” However, making computers do useful things on their own requires making machines intelligent, something which is yet to reach the masses, so I think a better name for these devices is robots.

What is a robot?
If we want machines to help us out in our daily tasks (e.g., cleaning, cooking, driving, playing with us, teaching us) we need machines that can both perceive their immediate environment and act intelligently. The perception-and-action loop is all that is necessary in order to turn everyday computers into intelligent robots. While it would be “nice” to build humanoid robots which look like this:

In my opinion, a robot is any device capable of executing its own perception and action loop. Thus, it is not necessary to have full-fledged humanoid robots to start reaping the benefit of consumer-robotics in-home robotics. Once we stop looking for smart machines with legs, and broaden our definition of a robot, it is easy to tell that the revolution has already begun.

Current desktop computers and laptops, which require input in the form of a key being pressed or a movement on the trackpad, can be viewed as semi-intelligent machines -- but because the input interfaces render the perception problem unnecessary, I do not consider them full-fledged robots. However, an iPhone running Siri is capable of sending a text message to one of our contacts via speech, so to some extent I consider Siri-enabled iPhones as robots. Tasks such as cleaning cannot be easily automated using Siri because no matter how dirty a floor is, it will never exclaim, “I’m dirty, please clean me!”. What we need is the ability for our devices to see -- namely, recognize objects in the environment (is this a sofa or a chair?), infer their state (clean vs. dirty), and track their spatial extent in the environment (these pixels belong to the plate).

Just add vision
We have spent decades using keyboards and mice, essentially learning a machine-specific language between us and machines. Whether you consider keystrokes as a high-level or low-level language is besides the point -- it is still a language, and more specifically a language which requires inputting everything explicitly. If we want machines to effortlessly interact with the world, we need to teach them our language and let them perceive the world directly. With the current advancements in computer vision, this is becoming a reality. But the world needs more visionary thinkers to become computer vision experts, more vision experts to start caring about broader uses of their technology, more everyday programmers to use computer vision in their projects, and more expert-grade computer vision tools accessible to those just starting out. Only then, will we be able to pool our collective efforts and finally interweave in-home robotics with the everyday human experience.

What's next?
Wouldn’t it be great if we had a general-purpose machine vision API which would render the most tedious and time-consuming part of training object detectors obsolete? Wouldn't it be awesome if we could all use computer vision without becoming mathematics gurus or having years of software engineering experience? Well, this might be happening sooner than you think. In an upcoming blog post, I will describe what this API is going to look like and why it’s going to make your life a whole lot easier. I promise not to disappoint...

Friday, December 31, 2010

why I should be hacking with a kinect

It was recently brought to my attention that Alex Berg a.k.a. Alexander Berg is hacking with a Kinect.
In case you didn't know, Alex Berg is an assistant professor at Stony Brook University as of Sept 2010. He came out of Jitendra Malik's group, and can be thought of as my academic uncle (because he got his PhD with Jitendra at basically the same time as my advisor, Alyosha Efros). I am a big fan of Alex Berg's work. (See the paper at ECCV 2010: What does classifying more than 10,000 image categories tell us? and note his upcoming workshop "Large Scale Learning for Vision" at CVPR 2011).

I had already known that Xiaofeng Ren has been hacking with RGB-D cameras such as the Kinect for some time now. Xiaofeng (pronunciation of first name) Ren is a research scientist at Intel Labs Seattle since 2008 and on the affiliate faculty at the CSE department at UW since 2010. He is another one my many academic uncles and has contributed greatly to the field of Computer Vision. For some of his recent work with Kinects, see his RGB-D project page. Xiaofeng Ren's work has also been very influential during my own research -- it is worthwhile to recall that he coined the term "superpixels", which is prevalent in contemporary Computer Vision literature.

So when I learned that these bad-ass ex-Berkeley hackers are hacking with Kinects, I figured it was the time to acquire one of my own. I bought a Kinect today and plan on playing with Alex Berg's kinect2matlab interface for Mac OS X soon!

So, why aren't you hacking with a kinect?