Open Source Toolkit for Statistical Machine Translation

Research Group of the 2006 Summer Workshop

The objective of this JHU Workshop is the development of novel methods for statistical machine translation that improve the state of the art, specifically factored translation models, and lattice-based decoding methods. As part of this workshop, we will implement these techniques and distribute them in an open source toolkit.

We propose to extend phrase-based statistical machine translation models using a factored representation. Current statistical MT approaches represent each word simply as their textual form. A factored translation approach replaces this representation with a feature vector for each word derived from a variety of information sources. These features may be the surface form, lemma, stem, part-of-speech tag, morphological information, syntactic, semantic or automatically derived categories, etc. This representation is then used to construct statistical translation models that can be combined together to maximize translation quality.

We also propose to extend current MT decoding methods to process multiple, ambiguous hypotheses in the form of an input lattice. A lattice representation allows an MT system to arbitrate between multiple ambiguous hypotheses from upstream processing so that the best translation can be produced. During the workshop we will implement lattice decoding and run experiments with errorful ASR input. We will compare different lattice-based strategies against single-hypothesis input results.

Final Report
Find details about the plans and progress of this project here.

Team Members
Senior Members
Chris Callison-Burch	CLSP
Nicola Bertoldi	ITC-IRST
Marcello Federico	ITC-IRST
Philipp Koehn	University of Edinburgh
Wade Shen	Lincoln Labs
Graduate Students
Ondrej Bojar	Charles University
Brooke Cowan	MIT
Chris Dyer	University of Maryland
Hieu Hoang	University of Edinburgh
Richard Zens	Aachen University
Undergraduate Students
Alexandra Constantin	Williams College
Evan Herbst	Cornell
Christine Corbett Moran	MIT

Open Source Toolkit for Statistical Machine Translation

Upcoming Seminars

Center for Language and Speech Processing