======================================================================== The EGYPT toolkit, Version 1.0. Developed by the Statistical Machine Translation team, WS'99, CLSP/JHU. http://www.clsp.jhu.edu/ws99/projects/mt ======================================================================== RELEASE HITORY: Dec 10, 1999: limited beta release. Feb 23, 2000: Official release version 1.0. CONTENTS: The EGYPT directory is organized such that it has the following directory trees: 1. The "tools" directory tree: each generic (not language specific) tool that we developed is stored in its own subdirectory under the tools directory. In this release, the following tools are included: whittle/ a tool for preparing and splitting the corpora generating *.snt file formats (the format required by GIZA). GIZA/ the statistical models training program (developed in C++). cairo/ the word alignment visualiztion tool (developed in Java). cairoize scripts for generating alignments files in *.aln format (the format required by cairo) (developed in C and perl). misc/ some useful scripts for collecting corpus statistics. NOTE: The stack decoder that we used during the workshop is not distributed with this release. It still needs some more work and hopefully will be released in the near future (keep visiting the web page). 2. bin: this directory contains symbolic links to binary files of each one of the tools. 3. doc: this directory contains symbolic links to the tools README files. 4. There is one directory tree per language. Each language directory has the following subdirectories: corpora/ to keep the corpora for that language. tools/ to keep tools specific to that language (e.g. tokenizers) dict/ to keep dictionaries (if any) work/ we suggest that you use this subdirectory to keep intermediate files generated by the software tools. Currently, the following language subdirectories are included: arabic/ The Quran (Islam's holy book) in Arabic with English translation. It was aligned based on chapters and verse numbers. Verses are typically equivalent to sentences. We have not done any experiments with this corpora. english/ that contains english tokenizer and some monolingual text that was downloaded from the UN web site. It was used for our Timorese (Tetun) experiements. french/ The Language Data Consortium (LDC) has a large, sentence-aligned corpora (http://www.ldc.upen.edu/). Also, we are currently working on aligning a large, unencumbered hansard parallel corpora that we plan to release in the future. However, this may take a while. If you can't wait, then you can download the corpora from http:// tetun/ A small Tetun-English parallel corpora. It was sentence-aligned manually.