Researchers in statistical language modeling employ the concept of a stoplist. A stoplist is a list of commonly occuring words such as "the", "of", and "are." When using a statistical technique based on the bag-of-words assumption (word exchangeability), these stopwords are discarded with the hope that the remaining words are the truly informative ones. Although suitable for classification and clustering tasks, such an approach falls short of modelling the syntax in the english language.
I believe that we should stop using stop lists. These 'meaningless' words are the glue that binds together informative words and if we want to be able to perform tasks such as grammar checking spelling checking then we have to look beyond bag-of-words. By complementing models such as LDA with a latent syntactic labels per word, we can attain partial exchangeability.
Latent Semantic Topc = A topic that is used to denote high-level information about the target sentence.
Latent Syntactic Topic = A topic which denotes the type of word (such as noun,verb,adjective).
Consider the sentence:
I read thick books.
This sentence is generated from the syntactic skeleton [noun,verb,adjective,noun].
Ideally we want a to understand text in such a way that the generative process generates text that appears to be generated by a human.
I would like to thank Jon for pointing out the Integrating Topics and Syntax paper which talks about this. Only two days after I posted this entry he showed me this paper (of course Blei is one of the authors).