Tuesday, January 07, 2014

Tracking points in a live camera feed: A behind-the-scenes look at the VMX Project webapp

In our computer vision startup, vision.ai, we're using open-source tools to create a one-of-a-kind object recognition experience.  Our goal is to make state-of-the-art visual object recognition as easy as waving an object in front of your laptop's or smartphone's camera.  We've made a webapp and programming environment called VMX that allows you to teach your computer about objects without any advanced programming, nor any bulky software installations -- you'll finally be able to put your computer's new visual reasoning abilities to good use.  Today's blog post is about some of the underlying technology that we used to build the VMX prototype.  (To learn about the entire project and how you can help, please visit VMX Project on Kickstarter.)

The VMX project utilizes many different programming languages and technologies.  Many of the behind-the-scenes machine learning algorithms have been developed in our lab, but to make a good product it takes more than just robust backed algorithms.  On the front-end, the two key open source (MIT licensed) projects we rely on are AngularJS and JSFeat. AngularJS is an open-source JavaScript framework, maintained by Google, that assists with running single-page applications.  Today's focus will be on JSFeat, the Javascript Computer Vision Library we use inside the front-end webapp.  What is JSFeat?  Quoting Eugene Zatepyakin, the author of JSFeat, "The project aim is to explore JS/HTML5 possibilities using modern & state-of-art computer vision algorithms."

We use the JSFeat library to track points inside the video stream.  Below is a YouTube video of our webapp in action, where we enabled the "debug display" to show you what is happening to tracked points behind the scenes.  The blue points are being tracked inside the browser, the green box is the output of our object detection service (already trained on my face), and the black box is the interpolated result which integrates the backend service and the frontend tracker.

The tracker calculates an optical flow for a sparse feature set using the iterative Lucas-Kanade method with pyramids.  The algorithm basically looks at two consecutive video frames and determines how points move by using a straightforward least-squares optimization method. The Lucas-Kanade algorithm is a classic in the computer vision community -- to learn more see the Lucas-Kanade Wikipedia page or take a graduate level computer vision course. Alternatively, if you find me on the street and ask nicely, I might give you an impromptu lecture on optical flow.

Instead of using interest points, in our prototype video we used a regularly spaced grid of points covering the entire video stream.  This grid gets re-initialized every N seconds.  It avoids the extra expense of finding interest points inside every frame.  NOTE: inside our vision.ai computer vision lab, we are incessantly experimenting with better ways of integrating point tracks with strong object detector results.  What you're seeing is just an early snapshot of the technology in action.

To play with a Lucas-Kanade tracker, take a look at the JSFeat demo page which runs a point tracker directly inside your browser.  You'll have to click on points, one at a time.  You'll need Google Chrome or Firefox (just like our VMX project), and this will give you a good sense of what using VMX is going to be like once it is available.

To summarize, there are lots of great computer vision tools out there, but none of these tools can give you a comprehensive object recognition system which requires little-to-none programming experience.  There is a lot of work needed to put together appropriate machine learning algorithms, object detection libraries, web services, trackers, video codecs, etc.  Luckily, the team at vision.ai loves both code and machine learning.  In addition, having spent the last 10 years of my life working as a research in Computer Vision doesn't hurt.

Getting a PhD in Computer Vision and learning how all of these technologies work is a truly amazing experience.  I encourage many students to undertake this 6+ year journey and learn all about computer vision.  But I know the PhD path is not for everybody.  That's why we've built VMX.  So the rest of you can enjoy the power of industrial-grade computer vision algorithms and the ease of intuitive web-based interfaces, without the expertise needed to piece together many different technologies.  The number of applications of computer vision tech is astounding and it is a shame that such technology hasn't been delivered with such a lower barrier-to-entry earlier.

With VMX, we're excited that the world is going to experience visual object recognition the way it was meant to be experienced.  But for that to happen, we still need your support.  Check out our VMX Project on Kickstarter (the page has lots of additional VMX in action videos), and help spread the word.


  1. I would like to show it an outside photo and ask where it was taken. I can listen to music and identify the CD [much easier, of course].

  2. Hi Harlow,
    That is not an easy task, but here is a link to some work I did a few years back with colleagues at CMU on this image matching task: