Domain Adaptation in Statistical Machine Translation

Introduction: Statistical machine translation (SMT) systems perform poorly when applied on new domains. This degradation in quality can be as much as …“ of the original system’s performance; the Figure below provides a small qualitative example, and illustrates that unknown words (copied verbatim) and incorrect translations are major sources of errors. When parallel data is plentiful in a new domain, the primary challenge becomes that of scoring good translations higher than bad translations. This is often accomplished using either mixture models that downweigh the contribution of old domain corpora, or by subsampling techniques that attempt to force the translation model to pay more attention to new domain-like sentences. A more sophisticated approach recently demonstrated that phrase-level adaptation can perform better. However, these approaches are still less sophisticated than state-of-the-art domain adaptation (DA) techniques from the machine learning community. Such techniques have not been applied to SMT, likely due to the mismatch between SMT models and the classification setting that dominates the DA literature. The Phrase Sense Disambiguation (PSD) approach to translation, which treats SMT lexical choice as a classification task, allows us to bridge this gap. In particular, classification-based DA techniques can be applied to PSD to improve translation scoring. Unfortunately, this is not enough when only comparable data exists in the new domain. Here, we face the additional challenge of identifying unseen words and also unknown word senses of seen words and attempting to figure out potential translations for these lexical entries. Once we have identified potential translations, we still need to score them, and the techniques we developed for addressing the case of parallel data directly apply.


Old DomainNew Domain (Medical)
Original German textwenn das geschieht, würden die serben aus dem nordkosovo wahrscheinlich ihre eigene unabhängigkeit erklären.darreichungsform : weißes pulver und klares , farbloses lösungsmittel zur herstellung einer injektionslösung
Human translationif that happens, the serbs from north kosovo would probably have their own independence.pharmaceutical form : white powder and clear , colourless solvent for solution for injection
SMT outputif that happens, it is likely that the serbs of north kosovo would declare their own independence.darreichungsform : white powder and clear , pale solvents to establish a injektionslösung

Figure: Output of a SMT system. The left example is from the system’s old training domain, the right is from an unseen new domain. Incorrect translations are highlighted in red, the two German words are unknown to the system, while the two English words are incorrect word sense problems.

1. Understand domain divergence in parallel data and how it affects SMT models, through analysis of carefully defined test beds that will be released to the community.
2. Design new SMT lexical choice models to improve translation quality across domains in two settings:
a. When new domain parallel data is available, we will leverage existing machine learning algorithms to adapt PSD models, and explore a rich space of context features, including document level context and morphological features.
b. When we only have comparable data in the new domain, we will learn training examples for PSD by identifying new translations for new senses.

Approach: While BLEU scores suggest that SMT lexical choice is often incorrect outside of the training domain, we do not yet fully understand the sources of translation error for different domains, languages and data conditions. In a DA setting without new parallel data, we have identified unseen words and senses as the main sources of error in many new domains, by analyzing impacts on BLEU. We will conduct similar analyses for the setting with new parallel data. We will also consider sources of error like word alignment or decoding. We will exploit parallel text to better understand differences between general and domain-specific phrase usage, and their impact on SMT.

We can learn differences between general language terms, domain-specific terms, and domain-specific usages of general terms, by using their translations as a sense annotation. This is a complex task, since domain shifts are not the only cause of translation ambiguity. For instance, in English to French translation, “run” is usually translated in the computer domain as “éxécuter”, and in the sports domain as “courir”; but other senses (such as “diriger”, “to manage”) can appear in many domains. Sense distinctions also depend on language pairs, which suggests that comparable data in the input language truly is necessary. For example, consider the English words “virus” and “window”. When translating into French, regardless of whether one is in a general domain or a computer domain, they are translated the same way: as “virus” and “fenêtre”, respectively. However, when translating into Japanese, the domain matters. In a general domain, they are respectively translated as “病原体” and “窓”; but in a computer domain they are transliterated.

To build SMT systems that are adapted to a new domain, we first consider the setting with parallel data from the new domain. A baseline translation approach we will leverage explicitly models the domain-specificity of phrase pair types to re-estimate translation probabilities. Rather than using static mixtures of old and new translation probabilities, this approach learns phrase-pair specific mixture weights based on a combination of features reflecting the degree to which each old-domain phrase pair belongs to general language (e.g., frequencies, “centrality” of old model scores), and its similarity to the new domain (e.g., new model scores, OOV counts). By moving to a PSD translation model, we can attempt much more sophisticated adaptation, and better model the entire spectrum between general and domain specific senses. In PSD, based on training data extracted from word-aligned parallel data, a classifier scores each phrase-pair in the lexicon, using evidence from the input-language context. Although there are certainly non-lexical affects of domain shift, we will focus on the lexicon, which is the most fruitful target given our past experience.

With parallel data, our work will focus on adapting PSD to new domains in order to learn better scores for lexical selection. First, we will design adaptation algorithms for PSD, by applying existing learning techniques for DA. Such approaches typically have two goals: (1) to reduce the reliance of the learned model on aspects that are specific to the old domain (and hence will be unavailable at test time), and (2) to use correlations between related old-domain examples and new-domain examples to “port” parameters learned on the old to the new domain. Such techniques can be directly applied to the PSD translation model, using large context as features. Second, we will determine what features are most important for this task. We can limit ourselves to local contexts like in past work, or can use much larger contexts (the paragraph, or perhaps the entire document) to build better models. In addition, we will use morphological features to tackle the data sparsity issues that arise when dealing with small amounts of new domain data.

With only comparable text, we must spot phrases with new senses, identify their translations, and learn to score them. We will attack the identification challenge using context-based language models (n-gram or topic models) to identify new usages. For example, in the computer domain, one can observe that “window” still appears on the English side, but “窓” (the general domain word for “window”) has disappeared in Japanese, indicating a potential new sense. For identifying translations we will study dictionary mining or active learning. The scoring problem can be addressed exactly as before. While finding new senses and translations is a challenging problem even in a single domain, we believe that differences that might get lost in a single domain with plentiful data will be more apparent in an adaptation setting.

Evaluation: We will create standard experimental conditions for domain adaptation in SMT and make all resources available to the community. We will consider three very different domains with which we have past experience: medical texts, movie subtitles and scientific texts. We will focus on French-English data, since our team includes native speakers of these two languages.

We will evaluate the performance of all adapted and non-adapted translation systems using standard automatic metrics of translation quality such as BLEU and Meteor. However, we strongly suspect that these generic metrics do not adequately capture the impact of adaptation on domain-specific vocabulary, and we will investigate how to evaluate domain-specific translation quality in a more directly interpretable way. We will study lexical choice accuracy (automatically checking whether a translation predicted by PSD using source context is correct) using gold standard annotations. We will evaluate extracting this knowledge by manually correcting automatic word-alignments and also by using terminology extraction techniques (e.g., finding translations of the keywords in scientific texts, etc).

Organization: Before the workshop, we will collect and process all necessary data, train language models, topic models and baseline SMT and PSD systems. During the workshop, we will focus exclusively on data analysis, design and evaluation of new algorithms.

Conclusion: Domain mismatch is a significant challenge for statistical machine translation. Our proposed work will elucidate this problem through careful data analysis, will provide test beds for future research, will close the gap between statistical domain adaptation and statistical machine translation, and will improve translation quality through novel methods for identifying new senses from comparable corpora.


Team Members
Senior Members
Marine CarpuatNational Research Council Canada
Hal Daumé IIIUniversity of Maryland
Alexander FraserUniversity of Stuttgart
Chris QuirkMicrosoft Research
Graduate Students
Fabienne BrauneUniversity of Stuttgart
Ann CliftonSimon Fraser University
Ann IrvineJohns Hopkins University
Jagadeesh JagarlamudiUniversity of Maryland
John MorganArmy Research Laboratory
Majid RazmaraSimon Fraser University
Ales TamchynaCharles University
Undergraduate Students
Katharine HenryUniversity of Chicago
Rachel RudingerYale University
Affiliate Members
George FosterNational Research Council Canada

Center for Language and Speech Processing