Andy's List
From CLSP Wiki
1. Phrasal alignment:
If GenPar is to be at all competitive (however this is measured) with 'state-of-the-art' Phrase-based SMT, then we need to move from a word- to a phrase-based model (and subsequent implementation). We need a phrase-based model also to deal 'properly' with:
- 'hard' cases (headswitching, relation changing etc.); - DOT (and beyond)
Specifically, we might: - at least in the first instance, see to what extent our algorithm from Coling-04 [Groves, Hearne & Way, 2004] is _directly_ implementable into GenPar; - try to make PMTGs use phrasal alignments the best way we can, i.e. induce grammars from tree pairs incorporating phrasal alignments and retaining the most information possible. - investigate how the word- and phrase-based alignments work together to improve alignment; - investigate how the phrasal alignments would actually be used, e.g. in GenPar, would we (still) require that all words in one language be linked to some (possibly non-)word in the other?
Declan, Yihai, Mary & Dekai might also be interested in this sub-task.
2. w2w models:
Currently we have two possible w2w models, which should be compared using controlled experiments:
- Giza++ - (Dan's) 'built in' w2w module
Specifically: - the latter is (reported to be) more configurable, so this should be tested empirically. - 'improved' models of w2w alignment such as 'Giza++ and cognates' ought to be tested; - unidirectional, bi-directional, symmetrical, refined, competetive etc. methods should also be tested to see how they can improve the performance, especially during the alignment stage.
The point is that all envisaged MT models proceed on a BU basis, so w2w alignment needs to be as good as it can be ...
Declan is also likely to be interested in this (Dan too).
3. DOT:
We now have a 'side by side' tree viewer, which would be very useful for DOT. If we can get a phrase-based aligner working (in the time available), then this could be compared to a 'straightforward' (ho, ho) implementation of DOT within the GenPar framework.
Specifically, we might see: - to what extent our algorithm from [Hearne & Way, 2003; Hearne, 2005] is _directly_ implementable into GenPar. Given the amount of training examples we're (already) using, pruning is likely to be a considerable issue here ... - comparing (on a theoretical level) how the grammar currently being used in GenPar compares to DOT depth 1; - on a practical level, one could make GenPar use a DOT depth 1 grammar in place of a PMTG - or, more likely, converting the DOT grammar into a PMTG. Probably (at least for the time being) have to stick at depth 1 ...
Mary & Khalil are also likely to be interested in this.
4. Richer Models:
We'll hardly get to this in the time available, but we're interested in models of MT which go beyond CFG-trees.
Keith, Mary & Markus are also likely to be interested in this, but I won't go into any more detail here given the limited likelihood of getting anything done in two and a half weeks ...
5. French:
Given [1-4] above, it goes without saying that we'll be using (at least) French<=>English to test out these ideas, but also possibly Arabic<=>English.
Declan & Andrea (the latter w.r.t. evaluation, perhaps?!) are also likely to be interested in this.
