Andy's List

From CLSP Wiki

Jump to: navigation, search

1. Phrasal alignment:

If GenPar is to be at all competitive (however this is measured) with 'state-of-the-art' Phrase-based SMT, then we need to move from a word- to a phrase-based model (and subsequent implementation). We need a phrase-based model also to deal 'properly' with:

- 'hard' cases (headswitching, relation changing etc.); - DOT (and beyond)

Specifically, we might: - at least in the first instance, see to what extent our algorithm from Coling-04 [Groves, Hearne & Way, 2004] is _directly_ implementable into GenPar; - try to make PMTGs use phrasal alignments the best way we can, i.e. induce grammars from tree pairs incorporating phrasal alignments and retaining the most information possible. - investigate how the word- and phrase-based alignments work together to improve alignment; - investigate how the phrasal alignments would actually be used, e.g. in GenPar, would we (still) require that all words in one language be linked to some (possibly non-)word in the other?

Declan, Yihai, Mary & Dekai might also be interested in this sub-task.

2. w2w models:

Currently we have two possible w2w models, which should be compared using controlled experiments:

- Giza++ - (Dan's) 'built in' w2w module

Specifically: - the latter is (reported to be) more configurable, so this should be tested empirically. - 'improved' models of w2w alignment such as 'Giza++ and cognates' ought to be tested; - unidirectional, bi-directional, symmetrical, refined, competetive etc. methods should also be tested to see how they can improve the performance, especially during the alignment stage.

The point is that all envisaged MT models proceed on a BU basis, so w2w alignment needs to be as good as it can be ...

Declan is also likely to be interested in this (Dan too).

3. DOT:

We now have a 'side by side' tree viewer, which would be very useful for DOT. If we can get a phrase-based aligner working (in the time available), then this could be compared to a 'straightforward' (ho, ho) implementation of DOT within the GenPar framework.

Specifically, we might see: - to what extent our algorithm from [Hearne & Way, 2003; Hearne, 2005] is _directly_ implementable into GenPar. Given the amount of training examples we're (already) using, pruning is likely to be a considerable issue here ... - comparing (on a theoretical level) how the grammar currently being used in GenPar compares to DOT depth 1; - on a practical level, one could make GenPar use a DOT depth 1 grammar in place of a PMTG - or, more likely, converting the DOT grammar into a PMTG. Probably (at least for the time being) have to stick at depth 1 ...

Mary & Khalil are also likely to be interested in this.

4. Richer Models:

We'll hardly get to this in the time available, but we're interested in models of MT which go beyond CFG-trees.

Keith, Mary & Markus are also likely to be interested in this, but I won't go into any more detail here given the limited likelihood of getting anything done in two and a half weeks ...

5. French:

Given [1-4] above, it goes without saying that we'll be using (at least) French<=>English to test out these ideas, but also possibly Arabic<=>English.

Declan & Andrea (the latter w.r.t. evaluation, perhaps?!) are also likely to be interested in this.

Personal tools