We seek to bring together new ideas from linguistics (especially nonlinear phonology) with new ideas from artificial intelligence (especially graphical models and support vector machines) in order to better match human speech recognition performance. Specifically, we will focus on two aspects of human speech communication that are not well modeled by current ASR:
(1) Asynchrony between manner of articulation (the type of a sound, e.g., stop, fricative, vowel) and place of articulation (the shape of the lips and tongue when a sound is produced, e.g., lip rounding, position of the tongue tip). Asynchrony will be modeled by creating a dictionary of dynamic Bayesian networks . one for each word in the English language . each designed to learn word-dependent synchronization between manner and place. Specifically, the pronunciation model will consist of a graphical model of each word, with separate hidden state-streams representing the distinctive features of manner, place and voicing. Arcs in the graphical model will explicitly represent learnable approximate synchronization relationships between the different distinctive feature tiers.
(2) Extra attention to acoustic phonetic landmarks . consonant releases, consonant closures, and syllable nuclei. During the lip closure of a [p], there is no sound: in order to determine that the stop is a [p], a human listener must pay special attention to the 50ms immediately before stop closure and immediately after release. Current ASR systems pay attention uniformly to the signal at all times. This workshop will develop discriminative classifiers (support vector machines) that detect and classify perceptually important acoustic phonetic landmarks. Methods will be developed to integrate the discriminant computation of the support vector machine with the generative probabilistic framework of the graphical models. Specifically, one of the methods we propose to test is a set of low-dimensional Gaussian likelihood functions, synchronized with the landmarks specified by a graphical model, and observing the discriminant output scores of the support vector machine.
Evaluation experiments will use the proposed system to re-score lattices generated by a state of the art speech recognizer on the Switchboard test corpus. The baseline for comparison will be the maximum-likelihood path through the lattice before rescoring, i.e., based on acoustic and language model scores computed by a state-of-the-art speech recognizer. The summer.s effort will be considered a success if we are able to augment the acoustic model scores in a way that reduces word error rate of the maximum-likelihood path after rescoring.
The proposed research builds on results published by the workshop participants. Juneja and Espy-Wilson have demonstrated that support vector machines trained to detect and classify acoustic phonetic landmarks achieve 80% correct recognition in TIMIT of the six English manner class categories, using a total of only 160 trainable parameters. For comparison, a manner class recognizer consisting of six HMMs, each with three 13-mixture states observing a 48-dimensional vector (total: 22716 parameters) achieves manner class recognition accuracy of 74%. Livescu and Glass have demonstrated improved phoneme recognition in TIMIT using a distinctive feature based pronunciation model consisting of five streams per word. Lattice rescoring has been demonstrated in an oracle experiment by Hasegawa-Johnson. Using Greenberg’s transcriptions of the WS97 Switchboard sub-corpus, his experiment demonstrated that perfect knowledge of both manner and place distinctive features is sufficient for a 12% relative word error rate reduction in the maximum-likelihood path through a set of recognition lattices. The proposed effort will integrate these existing methods using new training scripts and lattice rescoring programs, and will test these ideas for improving word error rate.
Landmark-Based Speech Recognition: Report of the Working Group, Mark Hasegawa-Johnson
Distinctive feature detection and landmark-based rescoring, Amit Juneja
Feature/Landmark-based Pronunciation Modeling using Dynamic Bayesian Networks, Karen Livescu
Discriminative Rescoring using Landmarks, Katrin Kirchhoff
Maximum Entropy Techniques for min-WER Score Combination with Sausages, Kemal Sonmez
Beyond Landmark-Based Speech Recognition, Steven Greenberg
Automatic Identification and Classification of Words using Phonetic and Prosodic Features, Srividya Mohan
Pronunciation variability, Emily Coogan
Glottalization and Vowel Nasalization Detection, Tianyu Wang
|Jim Baker||Carnegie Mellon University|
|Mark Hasegawa-Johnson||University of Illinois|
|Katrin Kirchoff||University of Washington|
|Jennifer Muller||Department of Defense|
|Sarah Borys||University of Illinois|
|Ken Chen||University of Illinois|
|Amit Juneja||University of Maryland|
|Emily Coogan||University of Illinois|
|Tianyu Wang||Georgia Tech|