Monday, November 28, 2005

Latent Topics and the Turing Test

Researchers in statistical language modeling employ the concept of a stoplist. A stoplist is a list of commonly occuring words such as "the", "of", and "are." When using a statistical technique based on the bag-of-words assumption (word exchangeability), these stopwords are discarded with the hope that the remaining words are the truly informative ones. Although suitable for classification and clustering tasks, such an approach falls short of modelling the syntax in the english language.

I believe that we should stop using stop lists. These 'meaningless' words are the glue that binds together informative words and if we want to be able to perform tasks such as grammar checking spelling checking then we have to look beyond bag-of-words. By complementing models such as LDA with a latent syntactic labels per word, we can attain partial exchangeability.

Latent Semantic Topc = A topic that is used to denote high-level information about the target sentence.
Latent Syntactic Topic = A topic which denotes the type of word (such as noun,verb,adjective).

Consider the sentence:
I read thick books.

This sentence is generated from the syntactic skeleton [noun,verb,adjective,noun].

Ideally we want a to understand text in such a way that the generative process generates text that appears to be generated by a human.

I would like to thank Jon for pointing out the Integrating Topics and Syntax paper which talks about this. Only two days after I posted this entry he showed me this paper (of course Blei is one of the authors).

1 comment:

  1. Anonymous10:03 PM

    In Turing Test Two, two players A and B are again being questioned by a human interrogator C. Before A gave out his answer (labeled as aa) to a question, he would also be required to guess how the other player B will answer the same question and this guess is labeled as ab. Similarly B will give her answer (labeled as bb) and her guess of A's answer, ba. The answers aa and ba will be grouped together as group a and similarly bb and ab will be grouped together as group b. The interrogator will be given first the answers as two separate groups and with only the group label (a and b) and without the individual labels (aa, ab, ba and bb). If C cannot tell correctly which of the aa and ba is from player A and which is from player B, B will get a score of one. If C cannot tell which of the bb and ab is from player B and which is from player A, A will get a score of one. All answers (with the individual labels) are then made available to all parties (A, B and C) and then the game continues. At the end of the game, the player who scored more is considered had won the game and is more "intelligent".