Advances to Machine Translation and Language Understanding – Chris Callison-Burch (Johns Hopkins)
View Seminar Video
Abstract
Modern approaches to machine translation, like those used in Google’s online translation system, are data-driven. Statistical translation models are trained using bilingual parallel texts, which consist of sentences in one language paired with their translation into another language. Probabilistic word-for-word and phrase-for-phrase translation tables are extracted from human-translated parallel texts, and are then used as the basic building blocks in the automatic translation systems. Although these data-driven methods have been successfully applied to a small handful of the world’s languages, can they be used to translate all the world’s languages? I’ll describe cost- and model-focused innovations that make it plausible. I’ll also briefly outline how these methods can be used to help with the longstanding artificial intelligence goal of language understanding.
In this talk, I will present four research areas that my lab has been working on:
Improved translation models: I will demonstrate that syntactic translation models significantly outperform linguistically naive models for Urdu-English. Urdu is a low resource language with a word order that is significantly divergent from English. Syntactic information allows better generalizations to be learned from the bilingual training data.
Crowdsourcing: I have been using Amazon’s Mechanical Turk crowdsourcing platform to translate large volumes of text at low cost. I will show how we can achieve professional level translation quality using non-professional translators, at a cost that is an order of magnitude cheaper than professional translation. This makes it feasible to collect enough data to train statistical models, which I demonstrate for Arabic dialect translation.
Translation: without bilingual training data In addition to using crowdsourcing to reduce costs, I am introducing new methods that remove the dependence on expensive bilingual data by redesigning translation models so that they can be trained using inexpensive monolingual data. I will show end-to-end translation performance for a system trained only using a small bilingual dictionary and two large monolingual texts.
Natural language understanding: I will show how the data and methods from translation can be applied to the classic AI problem of understanding language. I will show how to learn paraphrases and other meaning-preserving English transformations using bilingual data. I will demonstrate how these can be used for a variety of monolingual text-to-text generation tasks like sentence compression, simplification, English as a Second Language (ESL) error correction, and poetry generation.
Biography
Chris Callison-Burch is currently an Associate Research Professor in the Computer Science Department at Johns Hopkins University, where he has built a research group within the Center for Language and Speech Processing (CLSP). In the fall he will be starting a tenure-track job at the Computer and Information Sciences Department at the University of Pennsylvania. He received his PhD from the University of Edinburgh’s School of Informatics in 2008 and his bachelors from Stanford University’s Symbolic Systems Program in 2000. His research focuses on statistical machine translation, crowdsourcing, and broad coverage semantics via paraphrasing. He has contributed to the research community by releasing open source software like Moses and Joshua, and by organizing the shared tasks for the annual Workshop on Statistical Machine Translation (WMT). He is the Chair of the North American chapter of the Association for Computational Linguistics (NAACL) and serves on the editorial boards of Computational Linguistics and the Transactions of the ACL (TACL).