Boosting systems for LVCSR – George Saon (IBM)

Abstract
Current ASR systems can reach high levels of performance for particular domains as attested by various government-sponsored speech recognition evaluations. This comes at the expense of an ever increasing complexity in the recognition architecture. Typically, LVCSR systems employ multiple decoding and rescoring passes with several speaker adaptation passes in-between. Unfortunately, a lot of human intervention is required in choosing which systems are good for combination, knowledge which is often task-dependent and cannot be easily transferred to other domains. Ideally, one would want an automatic procedure for training accurate systems/models which make complementary recognition errors. Boosting is a popular machine learning technique for incrementally building linear combinations of “weak” models to generate an arbitrarily “strong” predictive model. We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Additionally, we study the impact of boosting on ML and discriminatively trained acoustic models. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training.
Biography
George Saon received his M.Sc. and PhD degrees in Computer Science from the Henri Poincare University in Nancy, France in 1994 and 1997. From 1994 to 1998, he worked on two-dimensional stochastic models for off-line handwriting recognition at the Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA). Since 1998, Dr. Saon is with the IBM T.J. Watson Research Center where he tackled a variety of problems spanning several areas of large vocabulary continuous speech recognition such as discriminative feature processing, acoustic modeling, speaker adaptation and large vocabulary decoding algorithms. Some of the techniques that he co-invented are well known to the speech community like heteroscedastic discriminant analysis (HDA), implicit lattice discriminative training, lattice-MLLR, feature-space Gaussianization, fast FSM-based Viterbi decoding, etc. Since 2001, Dr. Saon has been a key member of IBM’s speech recognition team which participated in several U.S, government-sponsored evaluations for the EARS, SPINE and GALE programs. In the context of GALE, he is also foraying into statistical machine translation. He has published over 70 conference and journal papers and holds several patents in the field of ASR.

Center for Language and Speech Processing