Language Modeling Issues for Spanish Language Large Vocabulary Continuous Speech Recognition

The goal of this workshop project is to explore language modeling techniques to improve recognition of unrestricted, conversational Spanish, over telephone channels. The basic training and test data will be from the Spanish language component of the Linguistic Data Consortium’s Call Home corpus. This is a corpus of transcribed telephone conversations. Text corpora, as well as other sources of transcribed speech, will be available.

We will be starting with a baseline Spanish speech recognizer built with BBN’s Byblos speech recognition system. The workshop will be provided with N-best and/or lattice outputs from this recognizer. We will endeavor to develop and evaluate language models for improving on the baseline performance level. In particular it will be desirable to exploit specific aspects of the Spanish language to improve the performance of the recognizer. The N-best lists and lattices will provide one means for evaluating our ideas and perplexity measurements another. Our progress will also be measured by our improved understanding of how language characteristics should influence our choice of a language model for recognition.

 

Team Members
Senior Members
German Bordel
Pierre Dupont
Herb Gish
Jose Oncina
Carol Van Ess-Dykema
Graduate Students
Lin Chase
Eric Wheeler

Johns Hopkins University

Johns Hopkins University, Whiting School of Engineering

Center for Language and Speech Processing
Hackerman 226
3400 North Charles Street, Baltimore, MD 21218-2680

Center for Language and Speech Processing