Humans can transcribe conversational speech with nearly perfect accuracy; the best automatic speech recognition (ASR) systems currently
have word error rates over 20% on conversational speech transcription. What can we do to improve automatic speech recognition?
This project seeks to bring together new ideas from linguistics with new ideas from graphical models and support vector machines in order
to match human speech recognition performance by better modeling what happens in the brain when humans listen to speech. Specifically,
this project will focus on two aspects of human speech communication that are not well modeled by current ASR:
- Asynchrony between manner of articulation (the type of a sound, e.g., stop, fricative, vowel) and place of articulation
(the shape of the lips and tongue when a sound is produced, e.g., lip rounding, position of the tongue tip). Asynchrony will be modeled
by creating a dictionary of dynamic Bayesian networks . one for each word in the English language . each designed to learn word-dependent
synchronization between manner and place.
- Extra attention to consonant releases and closures. During the lip closure of a [p], there is no sound: in order
to determine that the stop is a [p], a human listener must pay special attention to the 50ms immediately before stop closure and
immediately after release. Current ASR systems pay attention uniformly to the signal at all times. This workshop will develop
discriminative classifiers (support vector machines) that detect and classify perceptually important events in the signal such as
consonant releases and closures.
Our goal will be to provide new information that was unavailable to the original recognizer, thus improving the accuracy of the
automatic speech recognition system.