|
|
The state of the art in automatic speech recognition
(ASR)
has advanced considerably for those languages for which large amounts of
data is available to build the ASR system. Obtaining such data is
usually very difficult as it includes tens of hours of recorded speech
along with accurate transcriptions, an on-line dictionary or lexicon which
lists how words are pronounced in terms of elementary sound units such
as phonemes, and on-line text resources. The text resources are used
to train a language model which helps the recognizer anticipate likely
words, the dictionary tells the recognizer identify how a word will sound
in terms of phonemes when it is spoken, and the speech recordings are used
to learn the acoustic signal pattern for each phoneme, resulting in a
hierarchy
of models which work together to recognize successive spoken words.
Relatively little research has been done for building speech recognition
systems for languages for which such data resources are not available ---
a situation which unfortunately is true for all but a few languages of
the world.
This project will investigate the use of speech from
diverse
source languages to build an ASR system for a single target
language.
We will study promising modeling techniques to develop ASR systems in
languages
for which large amounts of training data are not available. We
intend
to pursue three themes. The first concerns the development of
algorithms
to map pronunciation dictionary entries in the target language to elements
in the dictionaries of the source languages. The second theme will
be Discriminative Model Combination (DMC) of acoustic models in the
individual
source languages for recognition of speech in the target language.
The third theme will be development of clustering and adaptation
techniques
to train a single set of acoustic models using data pooled from the
available
source languages. The goal is to develop Czech Broadcast News (BN)
transcription systems using a small amount of Czech adaptation data to
augment training data available in English, Spanish, and Mandarin.
The best data for this modeling task would be natural, unscripted speech
collected on a quiet, wide-band acoustic channel. News broadcasts
are a good source of such speech and are fairly easily obtained.
Broadcast news data of other source or target languages, possibly German
or Russian, will be used if they become available in a suitable amount
and quality.
|