Tuesday, November 22, 2005

pittsburgh airport wireless

I'm currently sitting at the Pittsburgh Airport Terminal browsing the web. I finished reading Cryptonomicon 3 minutes ago.

I finished my Machine Project on Latent Dirichlet Allocation yesterday. I also won (tied for 1st actually) a photography competition in my Appearance Modeling class. Congrats Jean-Francois for splitting the win with me! Coincidently, we are working together on a project for that class.

Today in my Machine Learning class, Prof Mitchell talked about dimensionality reduction techniques. It still feels a bit weird when people call PCA and SVD an "unsupervised dimensionality reduction" technique. People should really make sure that they understand the singular value decomposition, not only as "some recondite matrix factorization", but as "the linear operator factorization."

I was thinking about using applying SVD to Latent Dirichlet Allocation features (the output of my machine lerning project). As an unsupervised machine learning technique, LDA automatically learns hidden topics. The output of LDA is the probability of a word belonging to a particular topics P(w|z). Variational inference could be used to find information about an unseed document. Given this novel document, we can determine P(z|d). In other words, LDA maps a document of arbitrary length to a vector on the K-simplex (the K-topic mixture proportions are a multinomial random variable).


By constructing a large data matrix with each row being the multinomial mixture weights of a particular document (this matrix woudl have as many columns as topics) and performing SVD, we would hope to be able to create a new set of K' uncorrelated topics. This is just like diagonalizing a matrix.

It would also be interesting to run LDA with a different number of topics (L={30,40,50,60,...,300}) and deterime the rank (or just look at the spectrum) of the data matrix. The rank would tell us how many 'truly independent' categories are present in our corpus. Here a category would be defined as a linear combination of latent topics, and it would be interesting to see how these 'orthonormal' categories obtained from SVD would relate to the original newsgroup categories.