The EGYPT Statistical Machine Translation Toolkit


This toolkit was developed by by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU).

The current release is version 1.0 and it includes the following:

1. Software Tools:
 

Tool
 Description
Whittle A tool for preparing and splitting bilingual corpora into training and testing sets. Whittle generates *.snt file formats (the input corpus format required by GIZA). Whittle is written in Perl.
GIZA Training program that learns statistical translation models from bilingual corpora. GIZA is written C++ with the STL library (tested using gnu C++).
cairo Word alignment visualization tool. Cairo is written in Java.
cairoize A tool for generating alignments files in *.aln format (the format required by cairo). Cairoize is written in C and perl.

NOTE:

The stack decoder that was developed for and used during the workshop is not yet distributed with this release. It still needs some more work and hopefully will be released in the near future.

2. Corpora:
 
 

Corpora
Description
East Timorese (Tetun) A small Tetun-English parallel corpora. It was sentence-aligned manually
Arabic The Quran (Islam's holy book) in Arabic with English translation. It was aligned semi-automatically based on chapters and  verse numbers. Verses are typically equivalent to sentences. We have not done any experiments with this corpora.
English It contains some monolingual text that was downloaded from the UN web site. It was used for our East Timorese (Tetun) experiments.
French Currently no French corpora is distributed. However, The Language Data Consortium (LDC) has a large, sentence-aligned corpora. Also, we are currently working on aligning a large, unencumbered hansard parallel corpora that we plan to release in the future. However, this may take a while. If you can't wait, then you can download the corpora from the Canadian Parliament web site.  If you do download it and sentence align it and you would like to include it in EGYPT distribution please see the note below.

NOTE:

Additional corpora will be added when become available. Also, if you would like to donate a parallel corpora to the research community, please send e-mail to yaser@isi.edu.
 

Download: