Machine Translation Lab Session

JHU Summer School on Human Language Technology

Wednesday July 9: Syntax for Statistical Machine Translation (WS03)
 Lab Session

Anoop Sarkar <anoop@cs.sfu.ca>
Simon Fraser University

These exercises should be executed on a Linux machine which will be assigned to your group
Copy the directory /export/ws03_mt/lab to your home directory:
cp -r /export/ws03_mt/lab .

Training EGYPT/GIZA++

In this part of the lab we will train a statistical machine translation system. We will be using the EGYPT system and the GIZA++ system which is an implementation of the IBM models for MT (with extensions).

The EGYPT toolkit was developed by by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU).

GIZA++
is an extension of the program GIZA (which was part of the SMT toolkit EGYPT). The extensions of GIZA++ were designed and written by Franz Josef Och.

All the programs run on linux. You will need to login to machines d01..d16 or e01..e16 in order to run these experiments
  1. Modify your PATH environment variable:
  2. Now to create the training data (a parallel corpus of aligned source e and foreign f text) for training the MT system. The particular translation task we will consider is translating from the archaic form of English used in some versions of the Bible to a more modern form of English:
    f: Abraham begat Isaac ; and Isaac begat Jacob ; and Jacob begat Judas and his brethren
    e: Abraham was the father of Isaac , and Isaac the father of Jacob , and Jacob the father of Judah and his brothers
    Other f languages available in the ~/mtrun/corpora directory are French, Spanish, Swedish and Tetun (East Timorese). As shown above, for this lab, we will be using two forms of English as source text and foreign text. This is useful since this will allow you to understand the kinds of decisions and the kinds of errors made by MT systems. Since training is computationally expensive we'll create a subset of our training data (note that system performance will be better with more training data). First, we create the training data for GIZA++:
    cd lab/giza-lab/english
    head -3000 bible.train.english > run/bible.e
    head -3000 bible.train.foreign > run/bible.f
  3. To train GIZA++ on the data created above, run the training code. The trainer takes three arguments, the source text, the ``foreign'' text and the directory where the output translation model is written. The trainer first runs a tool called whittle which creates the training data for GIZA++. The trainer then runs the iterative EM training procedure described in the tutorial (this will take quite some time to terminate, depending on your machine specs).
    cd run
    runGIZA++.pl bible.e bible.f linux     

Viewing word alignments with Cairo

  1. cd ~/lab/giza-lab/english/run

  2. In this part of the lab, we will visualize the alignments induced by GIZA++ using Cairo (also part of the EGYPT system, written by Mike Jahr and Noah Smith). First we need to ``cairoize'' the statistical MT model (notice the "." after the command)
    sh do_cairoize .
  3. Then run the Java based word alignment viewer, Cairo:
    /home/ws02/anoop/tools/cairo/run_cairo
  4. You will have to browse (using File|Open) to the location where the alignments were generated (in ~/lab/giza-lab/english/run). The alignments are stored in a filename of the form username.date.pid -- open up the file that corresponds to your username
  5. Find at least 5 distinct kinds of alignments that you think are incorrect. Explain for each one why you think it is an incorrect alignment.

Creating a Language Model with the CMU-Cambridge LM Toolkit

  1. Now that we have a trained translation model we can try to decode sentences from the foreign text into the source English text. Before we can run a decoder, we need to build a language model for the source English text. We will use the CMU-Cambridge Language Modeling Toolkit to construct a LM.
  2. First, we create a vocabulary list which will be used as input to the LM toolkit.
    cd ~/lab/giza-lab/english/run/lm
    perl voc.pl < ../../bible.train.english > english.bible.voc
  3. Now we create a language model (an n-gram model of the text) using the LM toolkit:
    perl ~anoop/tools/bin/runCMUToolkit.pl ../../bible.train.english -d . -v english.bible.voc

Translating with the ISI ReWrite Decoder

  1. cd ~/lab/giza-lab/english/run/decode
  2. The config file isi-decoder.config has to be edited. The top two lines of the config file should be modified as follows (note that you have to supply to full path to your home directory and you cannot use ~  instead):
    LanguageModelFile = YOUR_HOME_DIR/lab/giza-lab/english/run/lm/bible.train.english.binlm
    TranslationModelConfigFile = YOUR_HOME_DIR/lab/giza-lab/english/run/linux/tmconfig.cfg
  3. Now we can translate from modern English to archaic English. Note that the input to the decoder is the ``foreign'' text (modern English) and the output is archaic English. To run the decoder and save the results in a file output run:
    head -100 ../../bible.eval.foreign | isi-decoder.linux --config isi-decoder.config > english.out
  4. Examine the input to the translation and the output. Compare the entire process of translation that you have just run to the noisy channel model (for a graphic of the model use acroread ~/lab/noisychannel.pdf

 Evaluating Translations with BLEU metric

  1. cd ~/lab/giza-lab/english/run/eval
  2. Score the baseline - no translation
    perl mteval-v09c.pl -r ref.sgm -s src.sgm -t src1.sgm -b
  3. Score your translations
    perl trg2sgml.pl ../decode/english.out > trg.sgm
    perl mteval-v09c.pl -r ref.sgm -s src.sgm -t trg.sgm -b

 Improving Translation Quality

  1. Now your task is to improve translation quality. You will need to create a new training set, retrain GIZA++ models, Language models, decode. Here are some ideas to go on - you are welcome to try your own ideas

Word Segmentation and Machine Translation

In some languages, the written or textual script does not have whitespace characters between the words. The task of taking an input sentence and inserting legitimate word boundaries is called word segmentation. We will use finite state transducers (FSTs) for word segmentation in Chinese. The lexicon we will use to construct Chinese sentences is given in the table below. You are given a finite state machine (use acroread ~/lab/fsm1.pdf to view the FSM) which is an extremely simple grammar for sentences in Chinese using this lexicon. 

Chinese Word (pinyin)
English Translation
da4
big
da4jie1
avenue
wo3
I
fang4
place
jie1
avenue
jie3fang4
liberation
bu4
not
liao3jie3
understand
wang4
forget
wang4bu4liao3
unable to forget
fang4da4
enlarge
na3li3
where
zai4
at

Automating Word Segmentation using the AT&T FSM Toolkit



References


Acknowledgements