Whats up with pronunciation variation? Why its so hard to model and what to do about it – Dan Jurafsky (University of Colorado, Boulder Department of Linguistics, Department of Computer Science, Institute of Cognitive Science, & Center for Spoken Language Research)
View Seminar Video
Automatic recognition of human-to-machine speech has made fantastic progress in the last decades, and current systems achieve word error rates below 5% on many tasks. But recognition of human-to-human speech is much harder; error rates are often 30% or even higher. Many studies of human-to-human speech have shown that pronunciation variation is a key factor contributing to these high error rates. Previous models of pronunciation variation, however, have not had significant success in reducing error rates. In order to help understand why gains in pronunciation modeling have proven so elusive, we investigated which kinds of pronunciation variation are well captured by current triphone models, and which are not. By examining the change in behavior of a recognizer as it receives further triphone training, we show that many of the kinds of variation which previous pronunciation models attempt to capture, such as phone substitution or phone reduction due to neighboring phonetic contexts, are already well captured by triphones. Our analysis suggests rather that syllable deletion caused by non-phonetic factors is a major cause of difficulty for recognizers. We then investigated a number of such non-phonetic factors in a large database of phonetically hand-transcribed words from the Switchboard corpus. Using linear and logistic regression to control for phonetic context and rate of speech, we did indeed find very significant effects of non-phonetic factors. For example, words have extraordinarily long and full pronunciations when they occur near “disfluencies” (pauses, filled pauses, and repetitions), or initially or finally in turns or utterances, while words which have a high unigram, bigram, or reverse bigram (given the following word) probability have much more reduced pronunciations. These factors must be modeled with lexicons based on dynamic pronunciation probabilities; I describe our work-in-progress on building such a lexicon. This talk describes joint work with Wayne Ward, Alan Bell, Eric Fosler-Lussier, Dan Gildea, Cynthia Girand, Michelle Gregory, Keith Herold, Zhang Jianping, William D. Raymond, Zhang Sen, and Yu Xiuyang.
Dan Jurafsky is an assistant professor in the Linguistics and Computer Science departments, the Institute of Cognitive Science, and the Center for Spoken Language Research at the University of Colorado, Boulder. He was last at Hopkins for the JHU Summer 1997 Workshop managing the dialog act modeling group. Dan is the author with Jim Martin of the recent Prentice Hall textbook “Speech and Language Processing”, and is teaching speech synthesis and recognition this semester at Boulder. Dan also plays the drums in mediocre pop bands and the corpse in local opera productions, and is currently working on his recipe for “Three Cups Chicken”.