Toward Language-Independent Acoustic Modeling

The state of the art in automatic speech recognition (ASR) has advanced considerably for those languages for which large amounts of data is available to build the ASR system. Obtaining such data is usually very difficult as it includes tens of hours of recorded speech along with accurate transcriptions, an on-line dictionary or lexicon which lists how words are pronounced in terms of elementary sound units such as phonemes, and on-line text resources. The text resources are used to train a language model which helps the recognizer anticipate likely words, the dictionary tells the recognizer identify how a word will sound in terms of phonemes when it is spoken, and the speech recordings are used to learn the acoustic signal pattern for each phoneme, resulting in a hierarchy of models which work together to recognize successive spoken words. Relatively little research has been done for building speech recognition systems for languages for which such data resources are not available — a situation which unfortunately is true for all but a few languages of the world.

This project will investigate the use of speech from diverse source languages to build an ASR system for a single target language. We will study promising modeling techniques to develop ASR systems in languages for which large amounts of training data are not available. We intend to pursue three themes. The first concerns the development of algorithms to map pronunciation dictionary entries in the target language to elements in the dictionaries of the source languages. The second theme will be Discriminative Model Combination (DMC) of acoustic models in the individual source languages for recognition of speech in the target language. The third theme will be development of clustering and adaptation techniques to train a single set of acoustic models using data pooled from the available source languages. The goal is to develop Czech Broadcast News (BN) transcription systems using a small amount of Czech adaptation data to augment training data available in English, Spanish, and Mandarin. The best data for this modeling task would be natural, unscripted speech collected on a quiet, wide-band acoustic channel. News broadcasts are a good source of such speech and are fairly easily obtained. Broadcast news data of other source or target languages, possibly German or Russian, will be used if they become available in a suitable amount and quality.

Final Report

 

Team Members 
Senior Members
Sanjeev KhudanpurCLSP
Peter BeyerleinPRL
William BryneCLSP/JHU
John MorganWest Point
Joe PiconeMiss. State
Graduate Students
Juan HuertaCMU
Nino PeterekCharles Univ., CR
Undergraduate Students
Bhaskara MarthiUToronto
Wei WangRice

Center for Language and Speech Processing