Syntax for Statistical Machine Translation

Research Group of the 2003 Summer Workshop

In recent evaluations of machine translation systems, statistical systems based on probabilistic models have outperformed classical approaches based on interpretation, transfer, and generation. Nonetheless, the output of statistical systems often contains obvious grammatical errors. This can be attributed to the fact that the syntactic well-formedness is only influenced by local n-gram language models and simple alignment models. We aim to integrate syntactic structure into statistical models to address this problem. A very convenient and promising approach for this integration is the maximum entropy framework, which allows to integrate many different knowledge sources into an overall model and to train the combination weights discriminatively. This approach will allow us to extend a baseline system easily by adding new feature functions.

The workshop will start with a strong baseline — the alignment template statistical machine translation system that obtained best results in the 2002 DARPA MT evaluations. During the workshop, we will incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We want to investigate a broad range of possible feature functions, from very simple binary features to sophisticated tree-to-tree translation models. Simple feature functions might test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features will be derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We also plan to investigate features based on projection of parse trees from one language onto strings of another, a useful technique when parses are available for only one of the two languages. We will extend previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic.

We will work with the Chinese-English data from the recent evaluations, as large amounts of sentence-aligned training corpora, as well as multiple reference translations are available. This will also allow us to compare our results with the various systems participating in the evaluations. In addition, annotation is underway on a Chinese-English parallel tree-bank. We plan to evaluate the improvement of our system using both automatic metrics for comparison with reference translations (BLEU and NIST) as well as subjective evaluations of adequacy and fluency. We hope both to improve machine translation performance and advance the understanding of how linguistic representations can be integrated into statistical models of language.

Team Members
Senior Members
Sanjeev Khudanpur	CLSP
Daniel Gildea	University of Pennsylvania
Franz Och	USC/ISI
Anoop Sarkar	Simon Fraser University
Kenji Yamada	Xerox
Graduate Students
Alexander Fraser	USC/ISI
Shankar Kumar	JHU
Libin Shen	University of Pennsylvania
David Smith	JHU
Undergraduate Students
Katherine Eng	Stanford
Viren Jain	University of Pennsylvania
Jin Zhen	Mt. Holyoke

Syntax for Statistical Machine Translation

Upcoming Seminars

Center for Language and Speech Processing