Confusion-based Statistical Language Modeling for Machine Translation and Speech Recognition

Research Group of the 2011 Summer Workshop

How can we decide that one sentence is more likely in a language than another sentence, especially if those sentences have never been seen before in entirety? Why would we want to? The answer to the second question is that many natural language applications — machine translation, automatic speech recognition — produce a multitude of possible sentences as the output (of translation or recognition) and the likelihood of the resulting sentences in the language is a key way to choose between them. New methods for figuring out the answer to the first question is the topic of this summer workshop project. For the same “true” output, the set of competing outputs (‘confusions’) depends on the application: for speech recognition, the confusions typically sound similar (such as ‘their’ and ‘there’); while in machine translation, the confusions will depend on ambiguities that arise in the translation process for a particular language pair (different for, say, Chinese and German when translating into English). In this project, we will be investigating techniques to automatically generate possible confusions for a particular task and learn statistical models of language from such confusions. These models can then be used to do a better job of choosing which of the alternative outputs of a particular system is best. This project is a chance to work on cutting edge speech and natural language applications, and get your hands dirty underneath the hood of state-of-the-art systems, while trying to make them better.

Team Members
Senior Members
Sanjeev Khudanpur	CLSP
Chris Callison-Burch	CLSP
Dan Bikel	Google
Keith Hall	Google
Philipp Koehn	University of Edinburgh
Brian Roark	Oregon Health and Science University
Kenji Sagae	University of Southern California
Graduate Students
Puyang Xu	CLSP
Charley Chan	CLSP
Yuan Cao	CLSP
Eva Hasler	University of Edinburgh
Maider Lehr	Oregon Health and Science University
Emily Tucker	Oregon Health and Science University
Undergraduate Students
Nathan Glenn	Brigham Young University
Darcey Riley	University of Rochester
Affiliate Members
Damianos Karakos	CLSP
Adam Lopez	CLSP
Zhifei Li	Google
Matt Post	Johns Hopkins University
Murat Saraclar	Boğaziçi University
Izhak Shafran	Oregon Health and Science University

Confusion-based Statistical Language Modeling for Machine Translation and Speech Recognition

Upcoming Seminars

Center for Language and Speech Processing