A Scalable Distributed Syntactic, Semantic and Lexical Language Model – Shaojun Wang (Wright State University)

When:
February 8, 2011 all-day
2011-02-08T00:00:00-05:00
2011-02-09T00:00:00-05:00

Abstract
In this talk, I’ll present an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating n-gram, structured language model and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.
Biography
Shaojun Wang received his B.S. and M.S. in Electrical Engineering at Tsinghua University in 1988 and 1992 respectively, M.S. in Mathematics and Ph.D. in Electrical Engineering at the University of Illinois at Urbana-Champaign in 1998 and 2001 respectively. From 2001 to 2005, he worked at CMU, Waterloo and University of Alberta as a post-doctoral fellow. He joined the Department of Computer Science and Engineering at Wright State University as an assistant professor in 2006. His research interest is statistical machine learning, natural language processing, and cloud computing. He is now mainly focusing on two projects: large scale distributed language modeling and semi-supervised discriminative structured prediction, that are funded by NSF, Google and AFOSR. Both emphasize on scalability and parallel/distributed approaches to process extremely large scale datasets.

Center for Language and Speech Processing