Greetings, here's a quick summary of the MT planning meeting we had at Johns Hopkins University almost two weeks ago.  Sorry for being late, but I was out of town all last week!  (Thanks Yaser, for taking notes...) 

I have a more detailed list of software tasks that we need to do to prepare for our experiments; I'll send these in a separate message.

My impression from the meeting is that "distortion" may turn out to be a good focus for research in the workshop.  We already have some pretty clear ideas about things to do with the Collins parser for Czech (e.g., building a "Czech-prime" corpus) and English (e.g., re-scoring decoder output).  Dan and Mike also suggested using relative-position-movement parameters rather than Model 3's absolute-position-movement parameters -- it's probably a good idea to implement these things, perhaps initially based on the literature, and see if they actually reduce cross-entropy.  We can also mine existing Model 3 word alignments for patterns and ideas... and/or ultimately design a superior generative model for word-order shift.

        Kevin


Meeting summary.  In the morning, we did an overview of statistical MT, and looked at the pre-workshop software infrastructure (1-5 below).  In the afternoon, we discussed goals for the workshop itself (6-8 below).

1. Baseline Training

Model3 Training algorithm is ready using integer log probabilities. The tables are kept entirely in memory using hash tables. Suggestions have been made (mostly by Dan) to dump counts to a file and then use external sorting and then read them into memory with normalization. 

Also crude parallelization of the code has been discussed. The suggestion is to use something like forking of multiple processes that communicate tables through files that will merged at the end of the iteration and then normalized to produce the revised tables. Format of these tables should be specified prior to the workshop (most likely in the two weeks intensive preparation period).

2. Czech Resources

- Czech/English parallel corpus (67K sentence pairs = 2M words) is now available (and has since been sent to ISI).
- Czech/English dictionary is available (92% coverage of Czech part of the corpus).
- Collins parser for Czech will be available at the workshop.
- POS tagger and lemmatization program for Czech will be available at the workshop.
- Czech monolingual text (20M words) is available
- Jan suggested acquiring a commercial Czech/Englsih translation software (for under $500).

3. Decoder

Not finished yet but hopefully it will be ready before the workshop.

4. Other tools and resources

- Alignment tool (from Dan).
- The Hong Kong parliament corpus will not be available (David).
- Hand tagged french corpus will be available for our use in the workshop from the University of Montreal (Dan).
- French lemmatization software will be available (David).
- French POS tagger may be available (David will talk to Michel).

5. Other Issues

- David suggested incremental documentation and web-publishing as we progress.
- Dissimination and rights were also discussed.  We decided we should make a public-domain minicorpus available with the distribution (Dan suggested using bilingual text from the Canadian government web page).

6. What things would be identified as a successful result for the workshop?

- Any Czech MT system.
- "MT in a day" for a new language pair
- Improvements over baseline Model3 training (i.e., better quality)
- Ways to deal with small bilingual corpora (same quality as baseline with less data)   
- Make SMT toolkit for distribution
- Beat commercial MT
- Improve on translation performance on non-domain text (i.e., broaden scope of SMT)

Most of these things can be measured objectively if we plot cross-entropy vs. MT accuracy vs. training-set size, etc., in a number of controlled experiments.  Generating these plots should be a good contribution in itself, besides providing a baseline for experiments.

7.  Ways to achieve some of those results

David suggested a possible improvement over Model 3's distortion parameters.  With the help of last workshop's Czech parser, we may be able to write/learn rules for transforming a Czech sentence into a "Czech prime" sentence, which is Czech with English word order.  We would then use Model 3's distortion parameters to learn to map Czech prime to English, and hopefully this mapping would be more "regular" than Czech to English.

8.  Things that need to be done to support experiments

- Get Model3 Training running, installed at JHU, parallelized (1-3 days)
- Set up corpora at JHU (before workshop)
- Get decoder ready and running (before start of workshop)
- Get other tools ported and installed
- Port or implement evaluation software
- Visualization software (1 week)
- Evaluations:
        - automated and by hand.
        - cross-entopy vs translation quality (depends on (5)).
        - cross-entropy vs data size.
        - cross-entropy evaluation (depends on (1)).
- Morphology:
        Czech: (1 day)
        - install Czech tagger / analyzer.
        - tag bilingual corpus
        French:
        - building unsupervised / supervised morphological analyzer
        - install Michel's tagged corpus.
        - obtain max-entropy tagger, Brill tagger.
        - training tagger.
        English:
        - morphological analysis and tagging for English.
- Word Order ("distortion")
        Czech:
        - install Czech parser and parse Czech corpus (1 day after tagging and lemmatization)
        - create priming rules and then create Czech prime corpus.
        English:
        - re-ranking techinique of n-best lists generated from decoder.