This toolkit was developed by by the Statistical
Machine Translation team during the summer
workshop in 1999 at the Center for Language and Speech Processing at
Johns-Hopkins University (CLSP/JHU).
The current release is version 1.0 and it includes the following:
1. Software Tools:
|
|
|
| Whittle | A tool for preparing and splitting bilingual corpora into training and testing sets. Whittle generates *.snt file formats (the input corpus format required by GIZA). Whittle is written in Perl. |
| GIZA | Training program that learns statistical translation models from bilingual corpora. GIZA is written C++ with the STL library (tested using gnu C++). |
| cairo | Word alignment visualization tool. Cairo is written in Java. |
| cairoize | A tool for generating alignments files in *.aln format (the format required by cairo). Cairoize is written in C and perl. |
NOTE:
The stack decoder that was developed for and used during the workshop is not yet distributed with this release. It still needs some more work and hopefully will be released in the near future.
2. Corpora:
|
|
|
| East Timorese (Tetun) | A small Tetun-English parallel corpora. It was sentence-aligned manually |
| Arabic | The Quran (Islam's holy book) in Arabic with English translation. It was aligned semi-automatically based on chapters and verse numbers. Verses are typically equivalent to sentences. We have not done any experiments with this corpora. |
| English | It contains some monolingual text that was downloaded from the UN web site. It was used for our East Timorese (Tetun) experiments. |
| French | Currently no French corpora is distributed. However, The Language Data Consortium (LDC) has a large, sentence-aligned corpora. Also, we are currently working on aligning a large, unencumbered hansard parallel corpora that we plan to release in the future. However, this may take a while. If you can't wait, then you can download the corpora from the Canadian Parliament web site. If you do download it and sentence align it and you would like to include it in EGYPT distribution please see the note below. |
NOTE:
Additional corpora will be added when become available. Also, if you
would like to donate a parallel corpora to the research community, please
send e-mail to yaser@isi.edu.
Download: