|
|
Humans have little difficulty recognizing speech in noisy
environments, speech distorted by having passed through an unknown channel
or speech from nonnative speakers. We adapt to the characteristics of the
new speech, often after hearing only a few seconds of it. Adaptation
techniques have been developed for automatic speech recognizers which
attempt to similarly compensate for differences between the speech on
which the system was trained, and the speech which it has to recognize.
However, several minutes of speech from the new speaker or environment
have to be provided to the system to obtain any significant improvement in
recognition performance. An automatic speech recognition system employs a
number of models for small segments of speech sounds such as phonemes.
Simply put, transforming each of these models requires that a sufficient
number of samples of each segment be seen from the new speaker. When a
small amount of new speech is heard, humans are able to exploit
relationships between various sounds so that having heard a few of them in
the distorted environment is adequate to adjust for the unheard ones as
well. In automatic systems therefore, if sufficient speech is not
available to adapt all the models individually, some method must be
devised to transform the models of the unheard or insufficiently heard
segments based on the heard ones. The participants in this project plan to
alleviate the commonly used remedy of tying, or forcing to be identical,
the transformation of the models of related speech units. They instead
plan to study the dependencies between the speech units, so that the model
transformation for one unit influences but is not necessarily identical to
the transformation for another unit. They plan to use this knowledge to
transform each model individually without requiring a large sample of each
speech segment for adaptation. Modelling techniques they plan to employ
include covariance models such as Markov random fields and dependency
trees. |