Improving Statistical Parsers Using Cross-Corpus Data – Xiaoqiang Luo (IBM T.J. Watson Research Center)

October 15, 2002 all-day

View Seminar Video
The performance of a statistical parser often improves if trained with more labelled data. But acquiring labelled data is often expensive and labor-intensive. We address this problem by proposing to use data annotated for other purpose. Label information in other domain or corpus provides partial constraints for parsing, therefore EM algorithm can be employed naturally to infer missing information. I will present our results of improving a maximum entropy parser using cross-domain or cross-corpus data.

Xiaoqiang Luo got his bachelor degree from University of Science and Technology of China in 1990, and Ph.D from Johns Hopkins University in 1999, all in electrical enigeering. From 1998 till now, he has been working at IBM T.J Watson Research Center as a senior software engineer. He was responsible for developing the semantic parser and interpreter used in the IBM DARPA Communicator. His research interests inlcude statistical modeling in natural language processing (NLP), language modeling, speech recognition and spoken dialog system.

Center for Language and Speech Processing