Friday, December 02, 2005

Navigating two worlds

Today I want to talk about synonyms and photometric invariance while comparing and contrasting the world of language and the visual world. My primary objective is to build a vision system that can learn to recognize objects in an unsupervised or semi-supervised fashion. I want to stress the fact that I'm much more interested in machine learning these days than I ever was.

I have been recently introduced to unsupervised techniques in the field of statistical language modeling and the following discussion will revolve around the differences between the man-made world of text and the natural world of images.

Here, when I mention text I am referring to a legitimate configuration of English words. It is important to realize that in the world of language, there are two very different uses for words. In one case, words are mere vessels for the transportation of a high-level concept. Here, there is nothing special about a particular choice of words and many different configurations of words map to the same high-level semantic interpretation. On the other hand, a poetic use of language strives to convey a high-level meaning with a carefully selected configuration of words.

In the visual domain we can also treat images as having many purposes. In the first case images could capture 'a' configuration of the world and in the second case they could capture 'the' configuration of the world. Allow me to explain. 'A' configuration of the world represents a possible configuration of objects where there is nothing particularly interesting about that specific configuration. For example, when depicting 'a' configuration of a hikers camping on a mountaintop the color of the tent doesn't alter the high-level fact that there is a tent and the presence of snow on the mountain doesn't alter the mountain. On the other hand, when using images to capture 'the' configuration of the world the color of the tent and the presence of the snow does matter. 'The' configuration would represent some high-level concept such as 'Julie and Tim camping on Mount Sefton in March.' In the 'a' and 'the' configurations nothing was stated about the sky (cloudy, sunny, sunset,sunrise) thus both images could contain different skies while being true to their 'a' or 'the' purposes.

Although understanding the world of english text is easier than understanding the visual world, there are many similarities. Statistical co-occurence is the key idea behind unsupervised topic discovery and parts-of-speech tagging while it is also a necessary notion when trying to understand images. When local structures (letters,words,image patches) co-occur, we can use induction to explain this phenomenon. In some sense understanding data is not much more than mere compression of the data. Here I don't refer to compression as a way of reducing data set size so that the initial data set can be reconstructed in some L2-norm sense. I'm referring to a compression (a projection onto a lower dimensional space) such that the reconstruction preserves the high-level {semantic,visual} attributes that are relevant. Consider the 'the' configuration of the hikers mentioned in the paragraph above. A good compression would preserve {the identities of the hikers, the presence of snow, the colour of the tent} but it would discard anything about the sky if it was not relevant.

Within a few days I'll be posting my LDA results on unsupervised topic discovery in text. I will then quickly delineate some of the new directions I've been taking with respect to unsupervised segmentation of text (which was superficially concatenated as to eliminate the spaces) and how these results can be applied to the visual domain where object boundaries are what we want to find.