Deep Learning, Computer Vision, and the algorithms that are shaping the future of Artificial Intelligence.
Wednesday, December 24, 2008
Newton's Method Fractal Yet Again
In the future I plan on synchronizing the music with the fractals. Here is a cool screenshot from the movie when the background becomes white.
Friday, December 05, 2008
Using Computer Vision to Solve Jigsaw Puzzles
In this image I've shown puzzle piece A which is fixed and in red, and puzzle piece B as well as some likely transformations that when applied to B snap it to piece A. If I have more time over Xmas break and I get to finish the final puzzle -- I'll be sure to post the details.
Tuesday, November 18, 2008
Algorithmic Simplicity + Data > Algorithmic Complexity
We are now living in a Machine Learning generation where hand tweaked parameters are looked down upon and if you want to publish an object recognition paper you'll need to test your algorithm on a standard dataset containing hundreds of images spanning many different types of objects. There is still a lot of excitement about Machine Learning in the air and new approaches are constantly being introduced as the new 'state-of-the-art' on canonical datasets. The problem with this mentality is that researchers are introducing a lot of complicated machinery and it is often unclear whether these new techniques will stand the test of time.
Peter Norvig -- now at Google -- advocates an alternative view. Rather than designing more advanced machine to work with a measly 20,000 or so training images for an object recognition task -- we shouldn't be too eager to make conclusions when dealing with such paltry training sets. In a recent Norvig video lecture I watched he showed some interesting results where the algorithms that obtained the best performance on a small dataset no longer did the best when the size of the training set was increased by an order of magnitude. In some cases, when fixing the test set, the simplest algorithms provided with an order of magnitude more training data outperformed the most advanced 'state-of-the-art.' Also, the mediocre algorithms in the small training size regime often outperformed their more complicated counterparts once more data was utilized.
The next generation of researchers will inevitably be using much more training data than we are at the moment, so if we want our scientific contributions to pass the test of time, we have to focus on designing simple yet principled algorithms. Focus on simplicity. Consider a particular recognition task, namely car recognition. Without any training data we are back in the 1960/1970s generation where we have to hard-code rules about what it means to be a car in order for an algorithm to work on a novel image. With a small amount of labeled training data, we can now learn the parameters of a general parts-based car detector -- we can even learn the appearance of such parts. But what can we do with millions of images of cars? Do we even need much more than a large scale nearest neighbor lookup?
As Rodney Brooks once said, "The world is its own best representation," and perhaps we should follow Google's mentaly and simply equip our ideas with more, more, more training data.
Tuesday, November 04, 2008
Computer Vision as immature?
The field of computer vision can be characterized as immature and diverse ... Consequently there is no standard formulation of "the computer vision problem." ... no standard formulation of how computer vision problems should be solved.
I agree that there is no elegant equation akin to F=ma or Schrodinger's Wave Equation that is magically supposed to explain how meaning is supposed to be attributed to images. While this might seem like a weak point, especially to the mathematically inclined always seeking to generalize and abstract away, I am skeptical of Computer Vision ever being grounded in such an all-encompassing mathematical theory.
Being a discipline centered on perception and reasoning, there is something about Computer Vision that will make it forever escape formalization. State of the art computer vision systems that operate on images can return many different types of information. Some systems return bounding boxes of all object instances from a single category, some systems break up the image into regions (segmentation) and say nothing about object classes/categories, and other systems assign a single object-level category to the entire image without performing any localization/segmentation. Aside from objects, some systems (See Hoiem et al. and Saxena et al.) return a geometric 3D layout of the scene. While it seems that humans can do extremely well at all these tasks, it makes sense that different robotic agents interacting with the real world should percieve the world differently to accomplish their own varying tasks. Thing of biological vision -- do we see the same world as dogs? Is there an objective observer-independent reality that we are supposed to see? To me, perception is very personal, and while my hardware (brain) might appear similar to another human's I'm not convinced that we see/perceive/understand the world the same way.
I can imagine ~40 years ago researchers/scientists trying to come up with an abstract theory of computation that would allow one to run arbitrary computer programs. What we have today is myriad operating systems and programming languages suited for different crowds and different applications. While the humanoid robot in our living room is nowhere to be found, I believe if we wait until that day and inspect its internal working we will not see a beautiful rigorous mathematical theory. We will see AI/mechanical components developed by different researcher groups and integrated by other researchers -- the fruits of a long engineering effort. These bots will be always learning, always updating, always getting updates, and always getting replaced by newer and better ones.
Linear Support Vector Machine (SVM) in the Primal
Quite often when SVMs are taught in a Machine Learning course, the dual and kernelization are jumped into very quickly. While the formulation looks nice and elegant, I'm not sure how many students can come home after such a lecture and implement an SVM in a language such as MATLAB on their own.
I have to admit that I never really obtained a good grasp of Support Vector Machines until I sat through through John Lafferty's lectures in 10-702: Statistical Machine Learning which demistyfied them. The main idea was that an SVM is just like logistic regression but with a different loss function -- the hinge loss function.
A very good article written on SVMs, and how they can be efficiently tackled in the primal (which is super-easy to understand) is Olivier Chapelle's Training a Support Vector Machine in the Primal. Chapelle is a big proponent of primal optimization and he has some arguments on why primal approximations are better than dual approximations. On this webpage one can also find MATLAB code for SVMs which is very fast in practice since it uses a second order optimization technique. In order to use such a second order technique, the squared hinge loss is used (see the article for why). I have used this code many times in the past even for linear binary classification problems with over 6 million data points embedded in 7 dimensions.
In fact, this is the code that I have used in the past for distance function learning. So the next time you want to throw something simple and fast at a binary classification problem, check out Chapelle's code.
Thursday, October 30, 2008
creating visualizations via Google Earth and Matlab
In addition, if you have lots of data you can use the level of detail capabilities of KML to cut your data up into little chunks. Its just like an octree from you computer graphics class. Pretty easy to code up. All your data can be stored on a server so there is no need to have anybody download it. Its possible to make a webserver emit the .kml files properly and not as ascii so that Google Earth can directly open it.
Once I *finalize* some kml files I'll show simple examples of the Level of Detail visualizations.
Sunday, September 21, 2008
6.870 Object Recognition and Scene Understanding
Why should anybody care what courses new faculty are offering, especially if they are being taught at another academic institution? The answer is simple. The new rising stars (a.k.a the new faculty) teach graduate-level courses that reflect ideas which these professors are truly passionate about. Besides the few initial semesters, when new faculty sometimes have to teach introductory level courses, these special topic courses (most often they are grad-level courses) reflect what has been going on in their heads for the past 10 years. Such courses reflect the past decade of research interests (pursued by the new professor) and the material is often presented in such a way that the students will get inspired and have the best opportunity to one day surpass the professor. I'm a big advocate of letting faculty teach their own courses -- of course introductory level undergraduate courses still have to be taught somehow...
A new professor's publication list is a depiction of what kind of research was actually pursued; however, the material comprising a special topic course presents a theme -- a conceptual layout -- which is sometimes a better predictor of where a professor's ideas (and inadvertently the community's) are actually going long-term. If you want to see where Computer Vision is going, just see what new faculty are teaching.
On the course homepage,
Thanks to my advisor, Alyosha Efros, for pointing out this new course.
On another note, I'm back at CMU from a Google summer internship where I was working with Thomas Leung on computer vision related problems.
Sunday, August 10, 2008
What is segmentation? What is image segmentation?
Segmentation is a term that often pops up in technical fields, such as Computer Vision. I have attempted to write a short article on Knol about Image Segmentation and how it pertains to Computer Vision. Deciding to abstain from discussing specific algorithms -- which might be of interest to graduate students and not the population as a whole -- I instead target the high-level question, "Why segment images?" The answer, according to me, is that image segmentation (and any other image processing task) should be performed solely assist object recognition and image understanding.
Wednesday, July 30, 2008
The Duality of Perception and Action
This is a rather remarkable excerpt from "Advanced Image Understanding and Autonomous Systems," by David Vernon from Department of Computer Science in Trinity College Dublin, Ireland.
Thursday, July 24, 2008
More Newton's Method Fractals on Youtube
Friday, July 11, 2008
Learning Per-Exemplar Distance Functions == Learning Anisotropic Per-Exemplar Kernels for Non-Parametric Density Estimation
Non-parametric approach make very weak assumptions about the underlying distribution -- however the number of parameters is a non-parametric density estimator scales with the number of data points. Generally estimation proceeds as follows: a.) store all the data, b.) use something like a Parzen density estimate where a tiny Gaussian kernel is placed around each data point.
The theoretical strength of the non-parametric approach is that *in theory* we can approximate any underlying distribution. In reality, we would need a crazy amount of data to approximate any underlying distribution -- the pragmatic answer to "why be non-parametric?" is that there is no real learning going on. In a non-parametric approach, we don't a parameter estimation stage. While estimating the parameters of a single gaussian is quite easy, consider a simple gaussian mixture model -- the parameter estimation (generally done via EM) is not so simple.
With the availability of inexpensive memory, data-driven approaches have become popular in computer vision. Does this mean that by using more data we can simply bypass parameter estimation?
An alternative is to combine the best of both worlds. Use lots of data, make very few assumptions about the underlying density, and still allow learning to improve density estimates. Usually a single isotropic kernel is used in non-parametric density estimation -- if we really have no knowledge this might be the only thing possible. But maybe something better than an isotropic kernel can be used, and perhaps we can use learning to find out the shape of the kernel.
The local per-exemplar distance function learning algorithm that I presented in my CVPR 2008 paper, can be thought of as such a anisotropic per-data-point kernel estimator.
I put some MATLAB distance function learning code up on the project web page. Check it out!
Sunday, June 22, 2008
Going to Anchorage
On another note, yesterday I spent much of the day in beautiful San Francisco. I definitely need to make a few more visits there.
Thursday, June 05, 2008
computer vision summer internship at google
Here is an idea: imagine making sense of the billions of objects embedded in images contained in google street-view image database. Google is already blurring faces in these images -- which means they are running vision algorithms on this dataset -- but are google researchers finding makes/models of cars, reading street signs, analyzing building facades to see which homes are victorian/ranch/etc, aligning visual information with google maps, etc?
Google street view is an excellent portal from machine to the world. If there is ever any hope of visual recognition happening on a robot, then it will have to happen at Google first. First using immense computational power. If that works, why not outsource visual recognition capabilities to a company like Google? Imagine a little computer onboard your favorite humanoid robot that is actually communicating via some standard recognition API with google's servers. What the robot sees is sent over to Google for analysis -- then 'image understanding' data is propagated back. I imagine such a service could be set up, and for a fairly cheap price.
Wednesday, June 04, 2008
Shimon Edelman: Constraints on the nature of the neural representation of the visual world
The problem is that overfitting to what is currently *hot* at CVPR isn't very productive if you want to solve big problems. Philosophers, psychologists, roboticists, and cognitive neuroscientists have a lot to say about vision and offer plenty of ideas as to what they expect to see in a successful vision system. While being a CS graduate student something like "the problem of computer vision" might seem like a rather grand goal; however, these other scientists (from different fields) suggest that it is unlikely that a pure CS approach will get the glory.
Some concepts that are brought up in this paper are the following: ontological strategy, context, inherent ambiguities in segmentation, ineffability of the visual world, multidimensional similarity space. I think looking at vision from a philosophical point of view is not only enlightening, but suggests that what we should be after is more than just solving the problem of computer vision. What does it mean to solve the problem of computer vision after all? What we should be after is a theory of intelligence -- a theory of mind -- and strive to build truly intelligent machines.
Tuesday, May 20, 2008
dude, where's my image?
Anyways, you can just read his abstract and browse his results if you are interested in the kind of computer vision research that uses millions of images. The basic idea is to predict the location of an image using only information embedded inside the image (and a training set of over 6 million geo-tagged Flickr images.)
Saturday, May 17, 2008
what is recognition?
From my understanding, "category" == "class" and thus categorization and classification are the same thing! It is correct to say that when we categorize, we affix a label to some entity. But these labels do refer to categories, or classes. One can attribute the popularity of the term 'classification' to the field of machine learning. Categorization is a term that was more heavily used in psychology and only recently it is popping up in computer vision papers.
Because I see classification and categorization as the same thing, I don't agree that only one can be hierarchical.
Regarding the term recognition, the answer is a bit more complicated. In the field of computer vision, when one says that they are interested in recognition they are usually interested in recognizing novel instances from some predefined list of classes. To stress the interest in discrimination between a large number of object classes, vision researchers have recently begun using terms such as "a visual categorization system" or they talk about "object class recognition."
In all places that I have seen this term pop up, "identification" refers to specific instances. A face identification system might be designed to find faces of George Bush and might work on top of a face-class recognition system. The problem is that early work in computer vision was usually concerned with a fixed number of objects and the goal was to find those exact object instances inside an image -- and this was referred to as simply "recognition." Nowadays, we often use the term "recognition" to refer to category-level recognition and not specific objects.
In conclusion, recognition is a very general term that has been applied to both category-level recognition (dog vs. cat vs. car vs. person) and recognition of specific object instances (this particular blue ball vs. this particular face). To be more precise, one can use the terms "category-level recognition" and "identification."
This post has been written in response to Vidit Jain's blog post titled "Etymology of common learning-related words such as recognize."
Wednesday, April 23, 2008
newton's method fractal
When people make fractal videos (check them out on youtube), they are usually zooming into a fixed fractal. I have generated a fractal where the axis is fixed and the equation is changing. Check it out!
Tuesday, April 08, 2008
Recognition by Association via Learning Per-exemplar Distances
Abstract:
We pose the recognition problem as data association. In this setting, a novel object is explained solely in terms of a small set of exemplar objects to which it is visually similar. Inspired by the work of Frome et al., we learn separate distance functions for each exemplar; however, our distances are interpretable on an absolute scale and can be thresholded to detect the presence of an object. Our exemplars are represented as image regions and the learned distances capture the relative importance of shape, color, texture, and position features for that region. We use the distance functions to detect and segment objects in novel images by associating the bottom-up segments obtained from multiple image segmentations with the exemplar regions. We evaluate the detection and segmentation performance of our algorithm on real-world outdoor scenes from the LabelMe dataset and also show some promising qualitative image parsing results.
http://www.cs.cmu.edu/~tmalisie/projects/cvpr08/Thursday, April 03, 2008
Vocabulary Lesson: Transductive Learning
Induction, as opposed to deduction, is a form of reasoning that makes generalizations based on individual instances. It is important to note that induction isn't the kind of reasoning that predicate calculus or any other logic system was meant to handle. The conclusions produced from induction might have a high probability of being true but are never as certain as the inputs. The generalizations obtained from induction can be propagated onto newly observed inputs. One can think of a generalization obtained from induction as a function -- an abstract entity that can always map inputs to outputs.
The Marriam-Webster definition of Transduction states that it is: the transfer of genetic material from one microorganism to another by a viral agent (as a bacteriophage). While this definition has its roots in one particular branch of science, the crucial component of this definition is still present. Transduction is the transfer of something from entity A to entity B.
The Machine Learning definition of Transduction states that it is reasoning from observed inputs to specific test inputs. The key difference between induction and transduction is that induction refers to learning a function that can be applied to any novel inputs, while transduction is only concerned with transferring some property onto a specific set of test inputs.
Rather than paraphrasing Wikipedia, the interested reader should do some follow research of their own into the merits of Transductive Learning.
To conclude, a WILLOW Research Team member -- Olivier Duchenne -- gave a talk about their CVPR 2008 work on applying Transductive Learning to the problem of image segmentation. This was my first exposure to the concepts of transductive learning and it is always a good thing to learn new things.
Monday, March 31, 2008
keyword analysis
33.87% ndseg
4.84% ipod turn on
3.23% latent dirichlet allocation
3.23% bmvc 2007
3.23% ipod won't turn on
3.23% logistic normal latent dirichlet allocation
1.61% burton clash
1.61% ndseg fellowship 2007
1.61% park city utah blog
1.61% ipod turning on
1.61% 2008 ndseg winners
1.61% my dream car
1.61% ndseg thegradcafe
1.61% ndseg fellowship offers 2008
1.61% cvpr 2007
1.61% ndseg, heard back
1.61% ndseg anyone
1.61% computer vision grad school
1.61% ndseg forum
1.61% ipod display support url
1.61% nsf graduate fellowship hear back
1.61% my ipod wont turn on
1.61% latent dirichlet allocation gibbs sampling
1.61% jogging in pittsburgh, squirrel hill
1.61% ndseg 2008 winners
1.61% eye inverse optics
1.61% multiple segmentations
1.61% nsf graduate fellowship heard yet?
1.61% burton clash guitar
1.61% nsf graduate research fellowships
1.61% thegradcafe ndseg
1.61% nsf grf march
1.61% my first paper on cvpr conference
1.61% ndseg fellowship
1.61% burton clash 2005
1.61% ndseg anyone heard
Wednesday, March 19, 2008
Understanding the past
By reading about past accomplishments and former ideologies in a particular field, one is essentially communicating with the ideas of the past. While many scholarly articles -- in a field such as Computer Vision -- are mostly devoted to algorithmic details and experimental evaluations, it isn't too difficult to find manuscripts which reveal the philosophic underpinnings of the proposed research. It is even possible to find papers which are entirely devoted to understanding the philosophical motivations of a past generation of research.
A prime example of interaction with the past is the paper "Object Recognition in the Geometric Era: A Retrospective," by Joseph L. Mundy from Brown University. Such a compilation of ideas -- perhaps even a mini-summa -- is quite accessible to any researcher in the field of Computer Vision. Avoiding the specific details of any algorithm developed in the so-called Geometric Era of Computer Vision, this text is both entertaining and highly educational. By reading such texts one is effectively communicating (albeit one-way) with a larger scientific community of the past.
To conclude, I would like to point out that neither do I agree with some of the past paradigms of Computer Vision, nor am I a die-hard proponent of the modern statistical machine learning school of thought. However, to explore new territories what better way to scope the world around you than by standing on the shoulders of giants? We should be aware of what has been done in the past, and sometimes de-emphasize algorithmic minutiae in order to understand the philosophical motivations behind former paradigms.
Wednesday, February 20, 2008
On Geometry and Computer Science
How about the term 'Computer Science'? Most of us probably still think about computer programming when we think about computer science. I believe that one day Computer Science will encompass so much of our daily lives that we will forget about the origins of this term. Dijkstra once said, "Computer Science is no more about computers than astronomy is about telescopes." I have to agree with him in the sense that Computer Science is a mental framework for solving problems -- it doesn't necessarily require computers.
How about 'Computer Vision'? Being a much younger discipline that Computer Science, we will have to wait and see what happens to this term. I've argued in earlier posts that it will become clear in the future that to solve the problem of Computer Vision, the field will inevitably need to become more concerned with intelligence, learning, and metaphysics and less about visual attributes and image processing. Maybe there will be no term Computer Vision in the future and the field of Machine Learning will take the glory. Or perhaps the term will stick but become so commonplace that we will forget how Computer Vision initially started out.
Saturday, February 02, 2008
Paris Adventure
I'm going to use a bunch of Google Maps features to track the places I've seen and visited. I will also be relying on Skype for communicating with friends and family overseas.
Au revoir,
Tomasz