Extending the search space of the Minimum Bayes-Risk Decoder for Machine Translation – Shankar Kumar (Google)
A Minimum Bayes-Risk (MBR) decoder seeks the hypothesis with the least expected loss function for a given task. In the field of machine translation, the technique was originally developed for rescoring k-best lists of hypotheses generated by a statistical model. In this talk, I will present our work on extending the search space of the MBR decoder to very large lattices and hypergraphs that contain on an average about 10^81 hypotheses! I will describe conditions on the loss function that enable efficient implementation of the decoder on such large search spaces. I will focus on the BLEU score (Papineni et. al.) as the loss function for machine translation. To satisfy the conditions on the loss function, I will introduce a linear approximation to the BLEU score.The MBR decoder under linearized BLEU can be easily implemented using Weighted Finite State Transducers. However, the resulting procedure is computationally expensive for a moderately large lattice. The costly step is the computation of n-gram posterior probabilities. I will next present an approximate algorithm which is much faster than our Weighted Finite State Transducer approach. This algorithm extends to translation hypergraphs generated by systems based on synchronous context free grammars. Inspired by work in speech recognition, I will finally present an exact and yet efficient algorithm to compute n-gram posteriors on both lattices and hypergraphs.The linear approximation to BLEU contains parameters which were initially derived from n-gram precisions seen on our development data. I will describe how we employed Minimum Error Rate training (MERT) to estimate these parameters.In the final part of the talk, I will describe an MBR inspired scheme to learn a consensus model over the n-gram features of multiple underlying component models. This scheme works on a collection of hypergraphs or lattices produced by syntax or phrase based translation systems. MERT is used to train the parameters. The approach outperforms a pipeline of MBR decoding followed by standard system combination while using less total computation.This is joint work with Wolfgang Macherey, Roy Tromble, Chris Dyer, John DeNero, Franz Och and Ciprian Chelba.
Shankar Kumar is a researcher in the speech group at Google. Prior to this, he worked in the Google’s effort on language translation. His current interests are in statistical methods for language processing with a particular emphasis on speech recognition and translation.