Tombone's Computer Vision Blog: classification

Showing posts with label classification. Show all posts

Friday, March 20, 2015

Deep Learning vs Machine Learning vs Pattern Recognition

Lets take a close look at three related terms (Deep Learning vs Machine Learning vs Pattern Recognition), and see how they relate to some of the hottest tech-themes in 2015 (namely Robotics and Artificial Intelligence). In our short journey through jargon, you should acquire a better understanding of how computer vision fits in, as well as gain an intuitive feel for how the machine learning zeitgeist has slowly evolved over time.

Fig 1. Putting a human inside a computer is not Artificial Intelligence

(Photo from WorkFusion Blog)

If you look around, you'll see no shortage of jobs at high-tech startups looking for machine learning experts. While only a fraction of them are looking for Deep Learning experts, I bet most of these startups can benefit from even the most elementary kind of data scientist. So how do you spot a future data-scientist? You learn how they think.

The three highly-related "learning" buzz words

“Pattern recognition,” “machine learning,” and “deep learning” represent three different schools of thought. Pattern recognition is the oldest (and as a term is quite outdated). Machine Learning is the most fundamental (one of the hottest areas for startups and research labs as of today, early 2015). And Deep Learning is the new, the big, the bleeding-edge -- we’re not even close to thinking about the post-deep-learning era. Just take a look at the following Google Trends graph. You'll see that a) Machine Learning is rising like a true champion, b) Pattern Recognition started as synonymous with Machine Learning, c) Pattern Recognition is dying, and d) Deep Learning is new and rising fast.

1. Pattern Recognition: The birth of smart programs

Pattern recognition was a term popular in the 70s and 80s. The emphasis was on getting a computer program to do something “smart” like recognize the character "3". And it really took a lot of cleverness and intuition to build such a program. Just think of "3" vs "B" and "3" vs "8". Back in the day, it didn’t really matter how you did it as long as there was no human-in-a-box pretending to be a machine. (See Figure 1) So if your algorithm would apply some filters to an image, localize some edges, and apply morphological operators, it was definitely of interest to the pattern recognition community. Optical Character Recognition grew out of this community and it is fair to call “Pattern Recognition” as the “Smart" Signal Processing of the 70s, 80s, and early 90s. Decision trees, heuristics, quadratic discriminant analysis, etc all came out of this era. Pattern Recognition become something CS folks did, and not EE folks. One of the most popular books from that time period is the ~~infamous~~ invaluable Duda & Hart "Pattern Classification" book and is still a great starting point for young researchers. But don't get too caught up in the vocabulary, it's a bit dated.

The character "3" partitioned into 16 sub-matrices. Custom rules, custom decisions, and custom "smart" programs used to be all the rage.

Quiz: The most popular Computer Vision conference is called CVPR and the PR stands for Pattern Recognition. Can you guess the year of the first CVPR conference?

2. Machine Learning: Smart programs can learn from examples

Sometime in the early 90s people started realizing that a more powerful way to build pattern recognition algorithms is to replace an expert (who probably knows way too much about pixels) with data (which can be mined from cheap laborers). So you collect a bunch of face images and non-face images, choose an algorithm, and wait for the computations to finish. This is the spirit of machine learning. "Machine Learning" emphasizes that the computer program (or machine) must do some work after it is given data. The Learning step is made explicit. And believe me, waiting 1 day for your computations to finish scales better than inviting your academic colleagues to your home institution to design some classification rules by hand.

"What is Machine Learning" from Dr Natalia Konstantinova's Blog. The most important part of this diagram are the "Gears" which suggests that crunching/working/computing is an important step in the ML pipeline.

As Machine Learning grew into a major research topic in the mid 2000s, computer scientists began applying these ideas to a wide array of problems. No longer was it only character recognition, cat vs. dog recognition, and other “recognize a pattern inside an array of pixels” problems. Researchers started applying Machine Learning to Robotics (reinforcement learning, manipulation, motion planning, grasping), to genome data, as well as to predict financial markets. Machine Learning was married with Graph Theory under the brand “Graphical Models,” every robotics expert had no choice but to become a Machine Learning Expert, and Machine Learning quickly became one of the most desired and versatile computing skills. However "Machine Learning" says nothing about the underlying algorithm. We've seen convex optimization, Kernel-based methods, Support Vector Machines, as well as Boosting have their winning days. Together with some custom manually engineered features, we had lots of recipes, lots of different schools of thought, and it wasn't entirely clear how a newcomer should select features and algorithms. But that was all about to change...

Further reading: To learn more about the kinds of features that were used in Computer Vision research see my blog post: From feature descriptors to deep learning: 20 years of computer vision.

3. Deep Learning: one architecture to rule them all

Fast forward to today and what we’re seeing is a large interest in something called Deep Learning. The most popular kinds of Deep Learning models, as they are using in large scale image recognition tasks, are known as Convolutional Neural Nets, or simply ConvNets.

ConvNet diagram from Torch Tutorial

Deep Learning emphasizes the kind of model you might want to use (e.g., a deep convolutional multi-layer neural network) and that you can use data fill in the missing parameters. But with deep-learning comes great responsibility. Because you are starting with a model of the world which has a high dimensionality, you really need a lot of data (big data) and a lot of crunching power (GPUs). Convolutions are used extensively in deep learning (especially computer vision applications), and the architectures are far from shallow.

If you're starting out with Deep Learning, simply brush up on some elementary Linear Algebra and start coding. I highly recommend Andrej Karpathy's Hacker's guide to Neural Networks. Implementing your own CPU-based backpropagation algorithm on a non-convolution based problem is a good place to start.

There are still lots of unknowns. The theory of why deep learning works is incomplete, and no single guide or book is better than true machine learning experience. There are lots of reasons why Deep Learning is gaining popularity, but Deep Learning is not going to take over the world. As long as you continue brushing up on your machine learning skills, your job is safe. But don't be afraid to chop these networks in half, slice 'n dice at will, and build software architectures that work in tandem with your learning algorithm. The Linux Kernel of tomorrow might run on Caffe (one of the most popular deep learning frameworks), but great products will always need great vision, domain expertise, market development, and most importantly: human creativity.

Other related buzz-words

Big-data is the philosophy of measuring all sorts of things, saving that data, and looking through it for information. For business, this big-data approach can give you actionable insights. In the context of learning algorithms, we’ve only started seeing the marriage of big-data and machine learning within the past few years. Cloud-computing, GPUs, DevOps, and PaaS providers have made large scale computing within reach of the researcher and ambitious "everyday" developer.

Artificial Intelligence is perhaps the oldest term, the most vague, and the one that was gone through the most ups and downs in the past 50 years. When somebody says they work on Artificial Intelligence, you are either going to want to laugh at them or take out a piece of paper and write down everything they say.

Further reading: My 2011 Blog post Computer Vision is Artificial Intelligence.

Conclusion

Machine Learning is here to stay. Don't think about it as Pattern Recognition vs Machine Learning vs Deep Learning, just realize that each term emphasizes something a little bit different. But the search continues. Go ahead and explore. Break something. We will continue building smarter software and our algorithms will continue to learn, but we've only begun to explore the kinds of architectures that can truly rule-them-all.

If you're interested in real-time vision applications of deep learning, namely those suitable for robotic and home automation applications, then you should check out what we've been building at vision.ai. Hopefully in a few days, I'll be able to say a little bit more. :-)

Until next time.

See discussion about this blog post on Hacker News.

Tuesday, August 18, 2009

exciting stuff at BAVM2009 #1: joint regularization

There were a couple of cool computer vision ideas that I was exposed to at BAVM2009. First, Trevor Darrell mentioned some cool work by Ariadna Quattoni on L1/L_inf regularization. The basic idea, which has also recently been used in other ICML 2009 works such as Han Liu and Mark Palatucci's Blockwise Coordinate Descent, is that you want to regularize across a bunch of problems. This is sometimes referred to as multi-task learning. Imagine solving two SVM optimization problems to find linear classifiers for detecting cars and bicycles in images. It is reasonable to expect that in high dimensional spaces these two classifiers will something in common. To provide more intuition, it might be the case that your feature set provides many irrelevant variables and when learning these classifiers independently much work is spent on removing these dumb variables. By doing some sort of joint regularization (or joint feature selection), you can share information across seemingly distinct classification problems.

In fact, when I was talking about my own CVPR08 work Daphne Koller suggested that this sort of regularization might work for my task of learning distance functions. However, I am currently exploiting the independence that I get from not doing any cross-problem regularization by solving the distance function learning problems independently. While regularization might be desirable, it couples problems and it might be difficult to solve hundreds of thousands of such problems jointly.

I will mention some other cool works in future posts.

Friday, June 19, 2009

A Shift of Focus: Relying on Prototypes versus Support Vectors

The goal of today's blog post is to outline an important difference between traditional categorization models in Psychology such as Prototype Models, and Support Vector Machine (SVM) based models.

When solving a SVM optimization problem in the dual (given a kernel function), the answer is represented as a set of weights associated with each of the data-centered kernels. In the Figure above, a SVM is used to learn a decision boundary between the blue class (desks) and the red class (chairs). The sparsity of such solutions means that only a small set of examples are used to define the class decision boundary. All points on the wrong side of the decision boundary and barely yet correctly classified points (within the margin) have non-zero weights. Many Machine Learning researchers get excited about the sparsity of such solutions because in theory, we only need to remember a small number of kernels for test time. However, the decision boundary is defined with respect to the problematic examples (misclassified and barely classified ones) and not the most typical examples. The most typical (and easy to recognize) examples are not even necessary to define the SVM decision boundary. Two data sets that have the same problematic examples, but significant differences in the "well-classified" examples might result in the same exact SVM decision boundary.

My problem with such boundary-based approaches is that by focusing only on the boundary between classes useful information is lost. Consider what happens when two points are correctly classified (and fall well beyond the margin on their correct side): the distance-to-decision-boundary is not a good measure of class membership. By failing to capture the "density" of data, the sparsity of such models can actually be a bad thing. As with discriminative methods, reasoning about the support vectors is useful for close-call classification decisions, but we lose fine-scale membership details (aka "density information") far from the decision surface.

In a single-prototype model (pictured above), a single prototype is used per class and distances-to-prototypes implicitly define the decision surface. The focus is on exactly the 'most confident' examples, which are the prototypes. Prototypes are created during training -- if we fit a Gaussian distribution to each class, the mean becomes the prototype. Notice that by focusing on Prototypes, we gain density information near the prototype at the cost of losing fine-details near the decision boundary. Single-Prototype models generally perform worse on forced-choice classification tasks when compared to their SVM-based discriminative counterparts; however, there are important regimes where too much emphasis on the decision boundary is a bad thing.

In other words, Prototype Methods are best and what they were designed to do in categorization, namely capture Typicality Effects (see Rosch). It would be interesting to come up with more applications where handing Typicality Effects and grading membership becomes more important than making close-call classification decision. I suspect that in many real-world information retrieval applications (where high precision is required and low recall tolerated) going beyond boundary-based techniques is the right thing to do.