Monday, October 28, 2013

Just Add Vision: Turning Computers Into Robots

The future of technology is all about improving the human experience. And the human experience is all about you -- you filling your life with less tedious work, more fun, less discomfort, and more meaningful human interactions. Whether new technology will let us enjoy life more during our spare time (think of what big screen TVs did for entertainment), or, let us become more productive at work (think of what calculators did for engineers), successful technologies have the tendency to improve our quality of life. 

Let’s take a quick look at how things got started... 



IBM started a chain of events by building affordable computers for small businesses to increase their productivity. Microsoft and Apple then created easy-to-use operating systems which allowed the common man to use computers at home for both entertainment (computer games) and being more productive (MS Office). Once personal computers started entering our homes, it was only a matter of a years until broadband internet access become widespread. Google then came along and changed the way we retrieve information from the internet while Social networking redefined how we interact with the people in our lives. Let's not forget modern smartphones, which let us use all of this amazing technology while on the go! 

Surely our iPhones will get faster and smaller while Google search will become more robust, but does the way we interact with these devices have to stay the same? And will these devices always do the same things? 

Computers without keyboards 
A lot of the world’s most exciting technology is designed to be used directly by people and ceases to provide much value once we stop directly interacting with our devices. I honestly believe that instead of wearing more computing devices (such as Google Glass) and learning new iOS commands, what we need is technology that can do useful things on its own, without requiring a person to hit buttons or custom keyboards. Because doing useful things entails having some sort of computational unit inside, it is fair to think of these future devices as “computers.” However, making computers do useful things on their own requires making machines intelligent, something which is yet to reach the masses, so I think a better name for these devices is robots. 

What is a robot? 
If we want machines to help us out in our daily tasks (e.g., cleaning, cooking, driving, playing with us, teaching us) we need machines that can both perceive their immediate environment and act intelligently. The perception-and-action loop is all that is necessary in order to turn everyday computers into intelligent robots. While it would be “nice” to build humanoid robots which look like this: 

In my opinion, a robot is any device capable of executing its own perception and action loop. Thus, it is not necessary to have full-fledged humanoid robots to start reaping the benefit of consumer-robotics in-home robotics. Once we stop looking for smart machines with legs, and broaden our definition of a robot, it is easy to tell that the revolution has already begun. 

Current desktop computers and laptops, which require input in the form of a key being pressed or a movement on the trackpad, can be viewed as semi-intelligent machines -- but because the input interfaces render the perception problem unnecessary, I do not consider them full-fledged robots. However, an iPhone running Siri is capable of sending a text message to one of our contacts via speech, so to some extent I consider Siri-enabled iPhones as robots. Tasks such as cleaning cannot be easily automated using Siri because no matter how dirty a floor is, it will never exclaim, “I’m dirty, please clean me!”. What we need is the ability for our devices to see -- namely, recognize objects in the environment (is this a sofa or a chair?), infer their state (clean vs. dirty), and track their spatial extent in the environment (these pixels belong to the plate). 

Just add vision
We have spent decades using keyboards and mice, essentially learning a machine-specific language between us and machines. Whether you consider keystrokes as a high-level or low-level language is besides the point -- it is still a language, and more specifically a language which requires inputting everything explicitly. If we want machines to effortlessly interact with the world, we need to teach them our language and let them perceive the world directly. With the current advancements in computer vision, this is becoming a reality. But the world needs more visionary thinkers to become computer vision experts, more vision experts to start caring about broader uses of their technology, more everyday programmers to use computer vision in their projects, and more expert-grade computer vision tools accessible to those just starting out. Only then, will we be able to pool our collective efforts and finally interweave in-home robotics with the everyday human experience. 

What's next?
Wouldn’t it be great if we had a general-purpose machine vision API which would render the most tedious and time-consuming part of training object detectors obsolete? Wouldn't it be awesome if we could all use computer vision without becoming mathematics gurus or having years of software engineering experience?  Well, this might be happening sooner than you think.  In an upcoming blog post, I will describe what this API is going to look like and why it’s going to make your life a whole lot easier.  I promise not to disappoint...