Syntax for Statistical Machine Translation

In recent evaluations of machine translation systems, statistical systems based on probabilistic models have outperformed classical approaches based on interpretation, transfer, and generation. Nonetheless, the output of statistical systems often contains obvious grammatical errors. This can be attributed to the fact that the syntactic well-formedness is only influenced by local n-gram language models and simple alignment models. We aim to integrate syntactic structure into statistical models to address this problem. A very convenient and promising approach for this integration is the maximum entropy framework, which allows to integrate many different knowledge sources into an overall model and to train the combination weights discriminatively. This approach will allow us to extend a baseline system easily by adding new feature functions.

The workshop will start with a strong baseline — the alignment template statistical machine translation system that obtained best results in the 2002 DARPA MT evaluations. During the workshop, we will incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We want to investigate a broad range of possible feature functions, from very simple binary features to sophisticated tree-to-tree translation models. Simple feature functions might test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features will be derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We also plan to investigate features based on projection of parse trees from one language onto strings of another, a useful technique when parses are available for only one of the two languages. We will extend previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic.

We will work with the Chinese-English data from the recent evaluations, as large amounts of sentence-aligned training corpora, as well as multiple reference translations are available. This will also allow us to compare our results with the various systems participating in the evaluations. In addition, annotation is underway on a Chinese-English parallel tree-bank. We plan to evaluate the improvement of our system using both automatic metrics for comparison with reference translations (BLEU and NIST) as well as subjective evaluations of adequacy and fluency. We hope both to improve machine translation performance and advance the understanding of how linguistic representations can be integrated into statistical models of language.

 

Team Members
Senior Members
Sanjeev KhudanpurCLSP
Daniel GildeaUniversity of Pennsylvania
Franz OchUSC/ISI
Anoop SarkarSimon Fraser University
Kenji YamadaXerox
Graduate Students
Alexander FraserUSC/ISI
Shankar KumarJHU
Libin ShenUniversity of Pennsylvania
David SmithJHU
Undergraduate Students
Katherine EngStanford
Viren JainUniversity of Pennsylvania
Jin ZhenMt. Holyoke

Center for Language and Speech Processing