Greetings, here's a summary of our first MT planning meeting. One thing we decided, of course, is when to meet next. Barring some disaster, May 21st seems to work, at Johns Hopkins. Janet Lambert of JHU should be in touch with us individually about accommodations.
There's a to-do list at the bottom of this message!
* * *
David provided some very useful do's and don'ts from past workshop experience. These included identifying bottlenecks early on, not running up computer usage in the last week, respecting other groups' computers, cross-fertilizing with other groups, and having fun on nights and weekends, the latter of which he claimed tends not to happen of its own accord.
Dan provided good general technical advice. This included building better models versus tracking down new resources. What has been mined from bilingual texts so far is a small fraction of what's in there. Dan also suggested designing plausible models before worrying about parameter estimation algorithms. He recommended building linguistic structure into the generative process. He also recommended symmetric models over directional ones, as the former can be converted into the latter, but not vice versa. And, rich linguistic structure not only restricts the space of possible answers, but it also provides useful classes for generalizing (backing off).
* * *
I gave a high-level slide review of "Model 3" statistical MT, which will serve as the baseline translation system for the workshop. I want to do a lower-level (yet comprehensible :-) version for anyone who is interested. It's pretty important for everybody to grasp the baseline concept in sufficient detail.
Yaser and John described the current state of the software to be used in the workshop. Yaser described an implemented corpus preprocessing tool that tokenizes English and French and replaces words by integers; the current corpus is 1.6M sentence pairs of length < 30. He described the current bilingual-text analyzer that implements "Model 3" of (Brown, 1993). It's operational, but has not been tested much yet -- much-needed cycles are available at JHU, according to David, as long as we don't complain about machines being unplugged and disks moving around. To be useful in the workshop, the current system will need to consume less memory (currently at 450M for only 60K sentence pairs and vocabulary of 10K). Dan suggested using streams instead of main memory for collecting counts. It will also have to run faster (currently Model 2 takes 40 minutes for 60K sentence pairs). Early access to the workshop facilities will make it clear how we should go about parallelizing the code. Yaser described an alignment drawing tool we obtained from Yeyi Wang of Microsoft. John suggested some sort of data visualization beyond alignments, including actual relevant parameter values.
John described the initial layout for a flexible decoder. He requested sample probability tables for input, and also offered a C++ library for log probability arithmetic that we can use in bilingual-text analysis. I also mentioned a couple of non-heuristic ISI decoders that may or may not be useful; one extracts the n most probable paths from a word lattice, and the other extracts the n most probable trees from a parse forest (scoring based on leaf/word trigrams, not structure).
We reviewed the status of our desired resources. We have a lot of English-French data, which beyond 1.6M sentence pairs is not sentence aligned. Rights to the Collins English-French dictionary to not seem feasible. There might be an English-French dictionary at CMU. Jan described 150K Czech-English sentence pairs from the Readers Digest, 20M words of Czech news, 390K words of treebank, and the Czech parser from last year's workshop. The status of Czech-English dictionaries is unclear. John said that a large English-Chinese text resource will be available soon from U. Penn as part of the Topic Detection and Tracking program.
* * *
In the afternoon we discussed possible workshop directions. This was more random and harder to summarize, but here's a cut at it. Building a software infrastructure for future experimentation and wide distribution is an important goal. So is demonstrating statistical MT for a distant language pair, such as Czech-English. If we concentrate on those two things, we should be able to have a success in six weeks, although one that may come at the expense of pursuing all kinds of novel ideas. How far we can pursue some of these novel ideas will be a function of (1) how much detail we can work out before the workshop, (2) whether or not existing tools can be brought to bear, and (3) how much of effort is available.
Everyone's interests still cluster around three topics -- exploiting morphology, exploiting syntactic structure, and exploiting new resources. I don't think we made a lot of progress in working these ideas out to the next level. It's hard to do that in a room with more than four people :-) We agreed to circulate email proposals during the next month, and comment on them. At the next meeting, I think we should break down for a few hours into smaller planning groups.
* * *
We want to evaluate the baseline system at varying training-set sizes, to get an idea of how sensitive statistical MT is to data size, and whether or not it will work for languages where not much bilingual data is available.
To evaluate new ideas in an end-to-end setting, Dan proposed doing comparative evaluation, i.e., saying whether the new system did better than the old system on various sentences. This might be doable by monolingual speakers, using a reference English translation. Nobody mentioned anything about hiring a student for the summer.
David proposed also doing cross-entropy evaluation. Doing that on unseen test data will require smoothing.
Whether or not we measure quality in terms of transition accuracy or cross-entropy, new ideas should shift the quality vs. training-set size curve in an appealing direction -- otherwise they aren't good ideas. Notice two possible success criteria: better translation quality with same amount of data, or same quality with less data.
* * *
David suggested another goal -- late in the workshop, to build a basic "MT in an afternoon" system for a completely new language pair (Chinese, Japanese, Hindi, or something more exotic). The accuracy might be questionable, but having the resources to pull it off at all would be really cool. A test might be to see if people outside the group can distinguish between translations provided by our system and those provided by a commercial system.
* * *
Here's our to-do list for before the next meeting:
Yaser/Kevin: run model 3 training, inspect results. improve time and space requirements.
John: work on model 3 decoder.
Yaser/Kevin: send sample probability tables to John.
John: send log arithmetic package to Yaser.
Dan: talk to Montreal people (and others) about French tagged data and taggers.
Dan: sentence-align English-French data beyond 1.6M sentence pairs.
Lynn: see if we can use (in very practical terms) the Systran English-French translator during the workshop
Jan: see if we can obtain or buy a Czech-English translation system, for use in the workshop.
Anyone: collect bilingual texts in languages of interest to you. probably should be at least 10K sentence pairs.
David: provide information about logging into pre-workshop machines to Yaser, Kevin, and John.
Kevin: contact Yeyi Wang about numerical data visualization software for alignments.
Anyone: circulate sketches of possible experiments in somewhat detail, and comment.
Jan: determine whether any Czech-English dictionary may be available for the workshop.
John: determine whether any French-English dictionary may be available for the workshop.
Anyone: think about a feasible smoothing scheme for test-set cross-entropy.
Anyone: think about how we want to handle unknown words.
That's it!
Kevin