Pronunciation Modeling of Mandarin Casual Speech

When people speak casually in daily life, they are not consistent in their pronunciation. In listening to such casual speech, it is quite common to find many different pronunciations of individual words. Current automatic speech recognition systems can reach a word accuracies above 90% when evaluated on carefully produced standard speech, but in recognizing casual, unplanned speech, performance drops to 75% or even lower. There are many reasons for this. In casual speech, one phoneme can shift to another. In Mandarin for example, the initial /sh/ in “wo shi (I am)” is often pronounced weakly and shifts into an /r/. In some other cases, sounds are dropped. In Mandarin, phonemes such as /b/, /p/, /d/, /t/, and /k/ are often reduced and as a result are often recognized as silence. These problems are made especially severe in Mandarin casual speech since most Chinese are non-native Mandarin speakers. Chinese languages such as Cantonese are as different from the standard Mandarin as French is different from English. As a result, there is an even larger pronunciation variation due to the influence of speakers’ native language.

We propose to study and model such pronunciation differences in casual speech using interviews found in Mandarin news broadcasts. We hope to include experienced researchers from both China and the US in the areas of pronunciation modeling, Mandarin speech recognition, and Chinese phonology.

Team Members
Senior Members
William Byrne CLSP/JHU
Pascale Fung HKUST
Terri Kamm Department of Defense
Tom Zheng Tsinghua University
Graduate Students
Zhanjiang Song Tsinghua University
Veera Venkatramani CLSP/JHU
Undergraduate Students
Umar Ruhi University of Toronto

Center for Language and Speech Processing