The goal is to develop language models for improving the accuracy in recognizing conversational speech. We want to explore the use of phrase structure (possibly including syntactic lexical information such as morphology, part-of-speech tags, etc.) to improve on the infamous trigram language model. Specifically, we would like to explore parsing-based models for the prediction of the next word.
We expect to use the various available treebanks (Wall Street Journal, Brown Corpus) for written text but we need a treebank for conversational speech. Specifically, we want one million words of Switchboard marked for disfluency and surface structure similar to the WSJ Treebank.