Friday, April 24, 2015

Making Visual Data a First-Class Citizen

Above all, don't lie to yourself. The man who lies to himself and listens to his own lie comes to a point that he cannot distinguish the truth within him, or around him, and so loses all respect for himself and for others. And having no respect he ceases to love.” ― Fyodor Dostoyevsky, The Brothers Karamazov


City Forensics: Using Visual Elements to Predict Non-Visual City Attributes

To respect the power and beauty of machine learning algorithms, especially when they are applied to the visual world, let's take a look at three recent applications of learning-based "computer vision" to computer graphics. Researchers in computer graphics are known for producing truly captivating illustrations of their results, so this post is going to be very visual. Now is your chance to sit back and let the pictures do the talking.

Can you predict things simply by looking at street-view images?

Let's say you're going to visit an old-friend in a foreign country for the first time. You've never visited this country before and have no idea what kind of city/neighborhood your friend lives in. So you decide to get a sneak peak -- you enter your friend's address into Google Street View.

Most people can look at Google Street View images in a given location and estimate attributes such as "sketchy," "rural," "slum-like," "noisy" for the given neighborhood. TLDR; A person is a pretty good visual recommendation engine.

Can you predict if this looks like a safe location? 
(Screenshot of Street view for Manizales, Colombia on Google Earth)

Can a computer program predict things by looking at images? If so, then these kinds of computer programs could be used to automatically generate semantic map layovers (see the crime prediction overlay from the first figure), help organize fast-growing cities (computer vision meets urban planning?), and ultimately bring about a new generation of match-making "visual recommendation engines" (a whole suite of new startups).

Before I discuss the research paper behind this idea, here are two cool things you could do (in theory) with a non-visual data prediction algorithm. There are plenty of great product ideas in this space -- just be creative.

Startup Idea #1: Avoiding sketchy areas when traveling abroad 
A Personalized location recommendation engine could be used to find locations in a city that I might find interesting (techie coffee shop for entrepreneurs, a park good for frisbee) subject to my constraints (near my current location, in a low-danger area, low traffic).  Below is the kind of place you want to avoid if you're looking for a coffee and a place to open up your laptop to do some work.

Google Street Maps, Morumbi São Paulo: slum housing (image from geographyfieldwork.com)

Startup Idea #2: Apartment Pricing and Marketing from Images
Visual recommendation engines could be used to predict the best images to represent an apartment for an Airbnb listing.  It would be great if Airbnb had a filter that would let you upload videos of your apartment, and it would predict that set of static images that best depict your apartment to maximize earning potential. I'm sure that Airbnb users would pay extra for this feature if it was available for a small extra charge. The same computer vision prediction idea can be applied to home pricing on Zillow, Craigslist, and anywhere else that pictures of for-sale items are shared.

Google image search result for "Good looking apartment". Can computer vision be used to automatically select pictures that will make your apartment listing successful on Airbnb?

Part I. City Forensics: Using Visual Elements to Predict Non-Visual City Attributes


The Berkeley Computer Graphics Group has been working on predicting non-visual attributes from images, so before I describe their approach, let me discuss how Berkeley's Visual Elements relate to Deep Learning.

Predicting Chicago Thefts from San Francisco data. Predicting Philadelphia Housing Prices from Boston data. From City Forensics paper.



Deep Learning vs Mid-level Patch Discovery (Technical Discussion)
You might think that non-visual data prediction from images (if even possible) will require a deep understanding of the image and thus these approaches must be based on a recent ConvNet deep learning method. Obviously, knowing the locations and categories associated with each object in a scene could benefit any computer vision algorithm.  The problem is that such general purpose CNN recognition systems aren't powerful enough to parse Google Street View images, at least not yet.

Another extreme is to train classifiers on entire images.  This was initially done when researchers were using GIST, but there are just too many nuisance pixels inside a typical image, so it is better to focus your machine learning a subset of the image.  But how do you choose the subset of the image to focus on?

There exist computer vision algorithms that can mine a large dataset of images and automatically extract meaningful, repeatable, and detectable mid-level visual patterns. These methods are not label-based and work really well when there is an underlying theme tying together a collection of images. The set of all Google Street View Images from Paris satisfies this criterion.  Large collections of random images from the internet must be labeled before they can be used to produce the kind of stellar results we all expect out of deep learning. The Berkeley Group uses visual elements automatically mined from images as the core representation.  Mid-level visual patterns are simply chunks of the image which correspond to repeatable configurations -- they sometimes contain entire objects, parts of objects, and popular multiple object configurations. (See Figure below)  The mid-level visual patterns form a visual dictionary which can be used to represent the set of images. Different sets of images (e.g., images from two different US cities) will have different mid-level dictionaries. These dictionaries are similar to "Visual Words" but their creation uses more SVM-like machinery.

The patch mining algorithm is known as mid-level patch discovery. You can think of mid-level patch discovery as a visually intelligent K-means clustering algorithm, but for really really large datasets. Here's a figure from the ECCV 2012 paper which introduced mid-level discriminative patches.

Unsupervised Discovery of Mid-Level Discriminative Patches

Unsupervised Discovery of Mid-Level Discriminative Patches. Saurabh Singh, Abhinav Gupta and Alexei A. Efros. In European Conference on Computer Vision (2012).

I should also point out that non-final layers in a pre-trained CNN could also be used for representing images, without the need to use a descriptor such as HOG. I would expect the performance to improve, so the questions is perhaps: How long until somebody publishes an awesome unsupervised CNN-based patch discovery algorithm? I'm a handful of researchers are already working on it. :-)

Related Blog Post: From feature descriptors to deep learning: 20 years of computer vision
The City Forensics paper from Berkeley tries to map the visual appearance of cities (as obtained from Google Street View Images) to non-visual data like crime statistics, housing prices and population density.  The basic idea is to 1.) mine discriminative patches from images and 2.) train a predictor which can map these visual primitives to non-visual data. While the underlying technique is that of mid-level patch discovery combined with Support Vector Regression (SVR), the result is an attribute-specific distribution over GPS coordinates.  Such a distribution should be appreciated for its own aesthetic value. I personally love custom data overlays.

City Forensics: Using Visual Elements to Predict Non-Visual City AttributesSean Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala. In IEEE Transactions on Visualization and Computer Graphics (TVCG), 2014.


Part II. The Selfie 2.0: Computer Vision as a Sidekick


Sometimes you just want the algorithm to be your sidekick. Let's talk about a new and improved method for using vision algorithms and the wisdom of the crowds to select better pictures of your face. While you might think of an improved selfie as a silly application, you do want to look "professional" in your professional photos, sexy in your "selfies" and "friendly" in your family pictures. An algorithm that helps you get the desired picture is an algorithm the whole world can get behind.



Attractiveness versus Time. From MirrorMirror Paper.

The basic idea is to collect a large video of a single person which spans different emotions, times of day, different days, or whatever condition you would like to vary.  Given this video, you can use crowdsourcing to label frames based on a property like attractiveness or seriousness.  Given these labeled frames, you can then train a standard HOG detector and predict one of these attributes on new data. Below if a figure which shows the 10 best shots of the child (lots of smiling and eye contact) and the worst 10 shots (bad lighting, blur, red-eye, no eye contact).


10 good shots, 10 worst shots. From MirrorMirror Paper.

You can also collect a video of yourself as you go through a sequence of different emotions, get people to label frames, and build a system which can predict an attribute such as "seriousness".

Faces ranked from Most serious to least serious. From MirrorMirror Paper.


In this work, labeling was necessary for taking better selfies.  But if half of the world is taking pictures, while the other half is voting pictures up and down (or Tinder-style swiping left and right), then I think the data collection and data labeling effort won't be a big issue in years to come. Nevertheless, this is a cool way of scoring your photos. Regarding consumer applications, this is something that Google, Snapchat, and Facebook will probably integrate into their products very soon.

Mirror Mirror: Crowdsourcing Better Portraits. Jun-Yan Zhu, Aseem Agarwala, Alexei A. Efros, Eli Shechtman and Jue Wang. In ACM Transactions on Graphics (SIGGRAPH Asia), 2014.

Part III. What does it all mean? I'm ready for the cat pictures.


This final section revisits an old, simple, and powerful trick in computer vision and graphics. If you know how to compute the average of a sequence of numbers, then you'll have no problem understanding what an average image (or "mean image") is all about. And if you're read this far, don't worry, the cat picture is coming soon.

Computing average images (or "mean" images) is one of those tricks that I was introduced to very soon after I started working at CMU.  Antonio Torralba, who has always had "a few more visualization tricks" up his sleeve, started computing average images (in the early 2000s) to analyze scenes as well as datasets collected as part of the LabelMe project at MIT. There's really nothing more to the basic idea beyond simply averaging a bunch of pictures.

Teaser Image from AverageExplorer paper.

Usually this kind of averaging is done informally in research, to make some throwaway graphic, or make cool web-ready renderings.  It's great seeing an entire paper dedicated to a system which explores the concept of averaging even further. It took about 15 years of use until somebody was bold enough to write a paper about it. When you perform a little bit of alignment, the mean pictures look really awesome. Check out these cats!



Aligned cat images from the AverageExplorer paper. 
I want one! (Both the algorithm and a Platonic cat)

The AverageExplorer paper extends simple image average with some new tricks which make the operations much more effective. I won't say much about the paper (the link is below), just take at a peek at some of the coolest mean cats I've ever seen (visualized above) or a jaw-dropping way to look at community collected landmark photos (Oxford bridge mean image visualized below).

Aligned bridges from AverageExplorer paper. 
I wish Google would make all of Street View look like this.


Averaging images is a really powerful idea.  Want to know what your magical classifier is tuned to detect?  Compute the top detections and average them.  Soon enough you'll have a good idea of what's going on behind the scenes.

Conclusion


Allow me to mention the mastermind that helped bring most of these vision+graphics+learning applications to life.  There's an inimitable charm present in all of the works of Prof. Alyosha Efros -- a certain aesthetic that is missing from 2015's overly empirical zeitgeist.  He used to be at CMU, but recently moved back to Berkeley.

Being able to summarize several of years worth of research into a single computer generated graphic can go a long way to making your work memorable and inspirational. And maybe our lives don't need that much automation.  Maybe general purpose object recognition is too much? Maybe all we need is a little art? I want to leave you with a YouTube video from a recent 2015 lecture by Professor A.A. Efros titled "Making Visual Data a First-Class Citizen." If you want to hear the story in the master's own words, grab a drink and enjoy the lecture.

"Visual data is the biggest Big Data there is (Cisco projects that it will soon account for over 90% of internet traffic), but currently, the main way we can access it is via associated keywords. I will talk about some efforts towards indexing, retrieving, and mining visual data directly, without the use of keywords." ― A.A. Efros, Making Visual Data a First-Class Citizen



Wednesday, April 08, 2015

Deep Learning vs Probabilistic Graphical Models vs Logic

Today, let's take a look at three paradigms that have shaped the field of Artificial Intelligence in the last 50 years: Logic, Probabilistic Methods, and Deep Learning. The empirical, "data-driven", or big-data / deep-learning ideology triumphs today, but that wasn't always the case. Some of the earliest approaches to AI were based on Logic, and the transition from logic to data-driven methods has been heavily influenced by probabilistic thinking, something we will be investigating in this blog post.

Let's take a look back Logic and Probabilistic Graphical Models and make some predictions on where the field of AI and Machine Learning is likely to go in the near future. We will proceed in chronological order.

Image from Coursera's Probabilistic Graphical Models course

1. Logic and Algorithms (Common-sense "Thinking" Machines)


A lot of early work on Artificial Intelligence was concerned with Logic, Automated Theorem Proving, and manipulating symbols. It should not be a surprise that John McCarthy's seminal 1959 paper on AI had the title "Programs with common sense."

If we peek inside one of most popular AI textbooks, namely "Artificial Intelligence: A Modern Approach," we immediately notice that the beginning of the book is devoted to search, constraint satisfaction problems, first-order logic, and planning. The third edition's cover (pictured below) looks like a big chess board (because being good at chess used to be a sign of human intelligence), features a picture of Alan Turing (the father of computing theory) as well as a picture of Aristotle (one of the greatest classical philosophers which had quite a lot to say about intelligence).

The cover of AIMA, the canonical AI text for undergraduate CS students

Unfortunately, logic-based AI brushes the perception problem under the rug, and I've argued quite some time ago that understanding how perception works is really the key to unlocking the secrets of intelligence. Perception is one of those things which is easy for humans and immensely difficult for machines. (To read more see my 2011 blog post, Computer Vision is Artificial Intelligence). Logic is pure and traditional chess-playing bots are very algorithmic and search-y, but the real world is ugly, dirty, and ridden with uncertainty.

I think most contemporary AI researchers agree that Logic-based AI is dead. The kind of world where everything can be perfectly observed, a world with no measurement error, is not the world of robotics and big-data.  We live in the era of machine learning, and numerical techniques triumph over first-order logic.  As of 2015, I pity the fool who prefers Modus Ponens over Gradient Descent.

Logic is great for the classroom and I suspect that once enough perception problems become "essentially solved" that we will see a resurgence in Logic.  And while there will be plenty of open perception problems in the future, there will be scenarios where the community can stop worrying about perception and start revisiting these classical ideas. Perhaps in 2020.

Further reading: Logic and Artificial Intelligence from the Stanford Encyclopedia of Philosophy

2. Probability, Statistics, and Graphical Models ("Measuring" Machines)


Probabilistic methods in Artificial Intelligence came out of the need to deal with uncertainty. The middle part of the Artificial Intelligence a Modern Approach textbook is called "Uncertain Knowledge and Reasoning" and is a great introduction to these methods.  If you're picking up AIMA for the first time, I recommend you start with this section. And if you're a student starting out with AI, do yourself a favor and don't skimp on the math.

Intro to PDFs from Penn State's Probability Theory and Mathematical Statistics course

When most people think about probabilistic methods they think of counting.  In laymen's terms, it's fair to think of probabilistic methods as fancy counting methods.  Let's briefly take a look at what used to be the two competing methods for thinking probabilistically.

Frequentist methods are very empirical -- these methods are data-driven and make inferences purely from data.  Bayesian methods are more sophisticated and combine data-driven likelihoods with magical priors.  These priors often come from first principles or "intuitions" and the Bayesian approach is great for combining heuristics with data to make cleverer algorithms -- a nice mix of the rationalist and empiricist world views.

What is perhaps more exciting that then Frequentist vs. Bayesian flamewar is something known as Probabilistic Graphical Models.  This class of techniques comes from computer science, and even though Machine Learning is now a strong component of a CS and a Statistics degree, the true power of statistics only comes when it is married with computation.

Probabilistic Graphical Models are a marriage of Graph Theory with Probabilistic Methods and they were all the rage among Machine Learning researchers in the mid-2000s. Variational methods, Gibbs Sampling, and Belief Propagation were being pounded into the brains of CMU graduate students when I was in graduate school (2005-2011) and provided us with a superb mental framework for thinking about machine learning problems. I learned most of what I know about Graphical Models from Carlos Guestrin and Jonathan Huang. Carlos Guestrin is now the CEO of GraphLab, Inc (now known as Dato) which is a company that builds large-scale products for machine learning on graphs and Jonathan Huang is a senior research scientist at Google.

The video below is a high-level overview of GraphLab, but it serves a very nice overview of "graphical thinking" and how it fits into the modern data scientist's tool-belt. Carlos is an excellent lecturer and his presentation is less about the company's product and more about ways of thinking about next-generation machine learning systems.

A Computational Introduction to Probabilistic Graphical Models
by GraphLab, Inc CEO Prof. Carlos Guestrin (Video Link updated 4/17/2018)

If you think that deep learning is going to solve all of your machine learning problems, you should really take a look at the above video.  If you're building recommender systems, an analytics platform for healthcare data, designing a new trading algorithm, or building the next generation search engine, Graphical Models are the perfect place to start.

Further reading:
Belief Propagation Algorithm Wikipedia Page
An Introduction to Variational Methods for Graphical Models by Michael Jordan et al.
Michael Jordan's webpage (one of the titans of inference and graphical models)

3. Deep Learning and Machine Learning (Data-Driven Machines)

Machine Learning is about learning from examples and today's state-of-the-art recognition techniques require a lot of training data, a deep neural network, and patience. Deep Learning emphasizes the network architecture of today's most successful machine learning approaches.  These methods are based on "deep" multi-layer neural networks with many hidden layers. NOTE: I'd like to emphasize that using deep architectures (as of 2015) is not new.  Just check out the following "deep" architecture from 1998.

LeNet-5 Figure From Yann LeCun's seminal "Gradient-based learning
applied to document recognition" paper.

When you take a look at modern guide about LeNet, it comes with the following disclaimer:

"To run this example on a GPU, you need a good GPU. It needs at least 1GB of GPU RAM. More may be required if your monitor is connected to the GPU.

When the GPU is connected to the monitor, there is a limit of a few seconds for each GPU function call. This is needed as current GPUs can’t be used for the monitor while performing computations. Without this limit, the screen would freeze for too long and make it look as if the computer froze. This example hits this limit with medium-quality GPUs. When the GPU isn’t connected to a monitor, there is no time limit. You can lower the batch size to fix the timeout problem."

It really makes me wonder how Yann was able to get anything out of his deep model back in 1998. Perhaps it's not surprising that it took another decade for the rest of us to get the memo.

UPDATE: Yann pointed out (via a Facebook comment) that the ConvNet work dates back to 1989. "It had about 400K connections and took about 3 weeks to train on the USPS dataset (8000 training examples) on a SUN4 machine." -- LeCun



NOTE: At roughly the same time (~1998) two crazy guys in California were trying to cache the entire internet inside the computers in their garage (they started some funny-sounding company which starts with a G). I don't know how they did it, but I guess sometimes to win big you have to do things that don't scale. Eventually, the world will catch up.

Further reading:
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognitionProceedings of the IEEE, November 1998.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, Winter 1989

Deep Learning code: Modern LeNet implementation in Theano and docs.


Conclusion

I don't see traditional first-order logic making a comeback anytime soon. And while there is a lot of hype behind deep learning, distributed systems and "graphical thinking" is likely to make a much more profound impact on data science than heavily optimized CNNs. There is no reason why deep learning can't be combined with a GraphLab-style architecture, and some of the new exciting machine learning work in the next decade is likely to be a marriage of these two philosophies.


You can also check out a relevant post from last month:
Deep Learning vs Machine Learning vs Pattern Recognition

Discuss on Hacker News

Saturday, April 04, 2015

Three Fundamental Dimensions for Thinking About Machine Learning Systems

Today, let's set cutting-edge machine learning and computer vision techniques aside. You probably already know that computer vision (or "machine vision") is the branch of computer science / artificial intelligence concerned with recognizing objects like cars, faces, and hand gestures in images. And you also probably know that Machine Learning algorithms are used to drive state-of-the-art computer vision systems. But what's missing is a birds-eye view of how to think about designing new learning-based systems. So instead of focusing on today's trendiest machine learning techniques, let's go all the way back to day 1 and build ourselves a strong foundation for thinking about machine learning and computer vision systems.




Allow me to introduce three fundamental dimensions which you can follow to obtain computer vision masterdom. The first dimension is mathematical, the second is verbal, and the third is intuitive.

On a personal level, most of my daily computer vision activities directly map onto these dimensions. When I'm at a coffee shop, I prefer the mathematical - pen and paper are my weapons of choice. When it's time to get ideas out of my head, there's nothing like a solid founder-founder face-to-face meeting, an occasional MIT visit to brainstorm with my scientist colleagues, or simply rubberducking (rubber duck debugging) with developers. And when it comes to engineering, interacting with a live learning system can help develop the intuition necessary to make a system more powerful, more efficient, and ultimately much more robust.

Mathematical: Learn to love the linear classifier

At the core of machine learning is mathematics, so you shouldn't be surprised that I include mathematical as one of the three fundamental dimensions of thinking about computer vision.

The single most important concept in all of machine learning which you should master is the idea of the classifier. For some of you, classification is a well-understood problem; however, too many students prematurely jump into more complex algorithms line randomized decision forests and multi-layer neural networks, without first grokking the power of the linear classifier. Plenty of data scientists will agree that the linear classifier is the most fundamental machine learning algorithm. In fact, when Peter Norvig, Director of Research at Google, was asked "Which AI field has surpassed your expectations and surprised you the most?" in his 2010 interview, he answered with "machine learning by linear separators." 

The illustration below depicts a linear classifier. In two dimensions, a linear classifier is a line which separates the positive examples from the negative examples.  You should first master the 2D linear classifier, even though in most applications you'll need to explore a higher-dimensional feature space. My personal favorite learning algorithm is the linear support vector machine, or linear SVM. In a SVM, overly-confident data points do not influence the decision boundary. Or put in another way, learning with these confident points is like they aren't even there! This is a very useful property for large-scale learning problems where you can't fit all data into memory. You're going to want to master the linear SVM (and how it relates to Linear Discriminant Analysis, Linear Regression, and Logistic Regression) if you're going to pass one of my whiteboard data-science interviews.


Linear Support Vector Machine from the SVM Wikipedia page


An intimate understanding of the linear classifier is necessary to understand how deep learning systems work.  The neurons inside a multi-layer neural network are little linear classifiers, and while the final decision boundary is non-linear, you should understand the underlying primitives very well. Loosely speaking, you can think of the linear classifier as a simple spring system and a more complex classifiers as a higher-order assembly of springs.


Also, there are going to be scenarios in your life as a data-scientist where a linear classifier should be the first machine learning algorithm you try. So don't be afraid to use some pen and paper, get into that hinge loss, and master the fundamentals.

Further reading: Google's Research Director talks about Machine Learning. Peter Norvig's Reddit AMA on YouTube from 2010.
Further reading: A demo for playing with linear classifiers in the browser. Linear classifier Javascript demo from Stanford's CS231n: Convolutional Neural Networks for Visual Recognition.
Further reading: My blog post: Deep Learning vs Machine Learning vs Pattern Recognition

Verbal: Talk about you vision (and join a community)

As you start acquiring knowledge of machine learning concepts, the best way forward is to speak up. Learn something, then teach a friend. As counterintuitive as it sounds, when it comes down to machine learning mastery, human-human interaction is key. This is why getting a ML-heavy Masters or PhD degree is ultimately the best bet for those adamant about becoming pioneers in the field. Daily conversations are necessary to strengthen your ideas.  See Raphael's "The School of Athens" for a depiction of what I think of as the ideal learning environment.  I'm sure half of those guys were thinking about computer vision.


An ideal ecosystem for collaboration and learning about computer vision


If you're not ready for a full-time graduate-level commitment to the field, consider a.) taking an advanced undergraduate course in vision/learning from your university, b.) a machine learning MOOC, or c.) taking part in a practical and application-focused online community/course focusing on computer vision.

During my 12-year academic stint, I made the observation that talking to your peers about computer vision and machine learning is more important that listening to teachers/supervisors/mentors.  Of course, there's much value in having a great teacher, but don't be surprised if you get 100x more face-to-face time with your friends compared to student-teacher interactions.  So if you take an online course like Coursera's Machine Learning MOOC, make sure to take it with friends.  Pause the video and discuss. Go to dinner and discuss. Write some code and discuss. Rinse, lather, repeat.

Coursera's Machine Learning MOOC taught by Andrew Ng


Another great opportunity is to follow Adrian Rosebrock's pyimagesearch.com blog, where he focuses on python and computer vision applications.  

Further reading: Old blog post: Why your vision lab needs a reading group

Homework assignment: First somebody on the street and teach them about machine learning.

Intuitive: Play with a real-time machine learning system

The third and final dimension is centered around intuition. Intuition is the ability to understand something immediately, without the need for conscious reasoning. The following guidelines are directed towards real-time object detection systems, but can also transfer over to other applications like learning-based attribution models for advertisements, high-frequency trading, as well as numerous tasks in robotics.

To gain some true insights about object detection, you should experience a real-time object detection system.  There's something unique about seeing a machine learning system run in real-time, right in front of you.  And when you get to control the input to the system, such as when using a webcam, you can learn a lot about how the algorithms work.  For example, seeing the classification score go down as you occlude the object of interest, and seeing the detection box go away when the object goes out of view is fundamental to building intuition about what works and what elements of a system need to improve.

I see countless students tweaking an algorithm, applying it to a static large-scale dataset, and then waiting for the precision-recall curve to be generated. I understand that this is the hard and scientific way of doing things, but unless you've already spent a few years making friends with every pixel, you're unlikely to make a lasting contribution this way. And it's not very exciting -- you'll probably fall asleep at your desk.

Using a real-time feedback loop (see illustration below), you can learn about the patterns which are intrinsically difficult to classify, as well what environmental variations (lights, clutter, motion) affect your system the most.  This is something which really cannot be done with a static dataset.  So go ahead, mine some intuition and play.
Visual Debugging: Designing the vision.ai real-time gesture-based controller in Fall 2013

Visual feedback is where our work at vision.ai truly stands out. Take a look at the following video, where we show a live example of training and playing with a detector based on vision.ai's VMX object recognition system.


NOTE: There a handful of other image recognition systems out there which you can turn into real-time vision systems, but be warned that optimization for real-time applications requires some non-trivial software engineering experience.  We've put a lot of care into our system so that the detection scores are analogous to a linear SVM scoring strategy. Making the output of a non-trivial learning algorithm backwards-compatible with a linear SVM isn't always easy, but in my opinion, well-worth the effort.

Extra Credit: See comments below for some free VMX by vision.ai beta software licenses so you can train some detectors using our visual feedback interface and gain your own machine vision intuition.

Conclusion

The three dimensions, namely mathematical, verbal, and intuitive provide different ways for advancing your knowledge of machine learning and computer vision systems.  So remember to love the linear classifier, talk to your friends, and use a real-time feedback loop when designing your machine learning system.