Multilingual Guidance for Unsupervised Linguistic Structure Prediction
Dipanjan Das, Carnegie Mellon University
September 27, 2011
Learning linguistic analyzers from unannotated data remains a major challenge; can multilingual text help? In this talk, I will describe learning methods that use unannotated data in a target language along with annotated data in more resource-rich "helper" languages. I will focus on two lines of work. First, I will describe a graph-based semi-supervised learning approach that uses parallel data to learn part-of-speech tag sequences through type-level lexical transfer from a helper language. Second, I will examine a more ambitious goal of learning part-of-speech sequences and dependency trees from raw text, leveraging parameter-level transfer from helper languages, but without any parallel data. Both approaches result in significant improvements over strong state-of-the-art monolingual unsupervised baselines.
Dipanjan Das is a Ph.D. student at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University. He works on statistical natural language processing under the mentorship of Noah Smith. He ?nished his M.S. at the same institute in 2008, conducting research on language generation with Alexander Rudnicky. Das completed his undergraduate degree in 2005 from the Indian Institute of Technology, Kharagpur, where he received the best undergraduate thesis award in Computer Science and Engineering and the Dr. B.C. Roy Memorial Gold Medal for best all-round performance in academics and co-curricular activities. He worked at Google Research, New York as an intern in 2010 and received the best paper award at the ACL 2011 conference. He has published and served as program committee member and reviewer at conferences such as ACL, NIPS, NAACL, COLING, and EMNLP during 2008–2011.