Landmark Based Speech Recognition

Research Group of the 2004 Summer Workshop

We seek to bring together new ideas from linguistics (especially nonlinear phonology) with new ideas from artificial intelligence (especially graphical models and support vector machines) in order to better match human speech recognition performance. Specifically, we will focus on two aspects of human speech communication that are not well modeled by current ASR:

(1) Asynchrony between manner of articulation (the type of a sound, e.g., stop, fricative, vowel) and place of articulation (the shape of the lips and tongue when a sound is produced, e.g., lip rounding, position of the tongue tip). Asynchrony will be modeled by creating a dictionary of dynamic Bayesian networks . one for each word in the English language . each designed to learn word-dependent synchronization between manner and place. Specifically, the pronunciation model will consist of a graphical model of each word, with separate hidden state-streams representing the distinctive features of manner, place and voicing. Arcs in the graphical model will explicitly represent learnable approximate synchronization relationships between the different distinctive feature tiers.

(2) Extra attention to acoustic phonetic landmarks . consonant releases, consonant closures, and syllable nuclei. During the lip closure of a [p], there is no sound: in order to determine that the stop is a [p], a human listener must pay special attention to the 50ms immediately before stop closure and immediately after release. Current ASR systems pay attention uniformly to the signal at all times. This workshop will develop discriminative classifiers (support vector machines) that detect and classify perceptually important acoustic phonetic landmarks. Methods will be developed to integrate the discriminant computation of the support vector machine with the generative probabilistic framework of the graphical models. Specifically, one of the methods we propose to test is a set of low-dimensional Gaussian likelihood functions, synchronized with the landmarks specified by a graphical model, and observing the discriminant output scores of the support vector machine.

Evaluation experiments will use the proposed system to re-score lattices generated by a state of the art speech recognizer on the Switchboard test corpus. The baseline for comparison will be the maximum-likelihood path through the lattice before rescoring, i.e., based on acoustic and language model scores computed by a state-of-the-art speech recognizer. The summer.s effort will be considered a success if we are able to augment the acoustic model scores in a way that reduces word error rate of the maximum-likelihood path after rescoring.

The proposed research builds on results published by the workshop participants. Juneja and Espy-Wilson have demonstrated that support vector machines trained to detect and classify acoustic phonetic landmarks achieve 80% correct recognition in TIMIT of the six English manner class categories, using a total of only 160 trainable parameters. For comparison, a manner class recognizer consisting of six HMMs, each with three 13-mixture states observing a 48-dimensional vector (total: 22716 parameters) achieves manner class recognition accuracy of 74%. Livescu and Glass have demonstrated improved phoneme recognition in TIMIT using a distinctive feature based pronunciation model consisting of five streams per word. Lattice rescoring has been demonstrated in an oracle experiment by Hasegawa-Johnson. Using Greenberg’s transcriptions of the WS97 Switchboard sub-corpus, his experiment demonstrated that perfect knowledge of both manner and place distinctive features is sufficient for a 12% relative word error rate reduction in the maximum-likelihood path through a set of recognition lattices. The proposed effort will integrate these existing methods using new training scripts and lattice rescoring programs, and will test these ideas for improving word error rate.

Final Report

Team Members
Senior Members
Jim Baker	Carnegie Mellon University
Steve Greenberg	Berkeley
Mark Hasegawa-Johnson	University of Illinois
Katrin Kirchoff	University of Washington
Jennifer Muller	Department of Defense
Kemal Sonmez	SRI
Graduate Students
Sarah Borys	University of Illinois
Ken Chen	University of Illinois
Amit Juneja	University of Maryland
Karen Livescu	MIT
Vidya Mohan	JHU
Undergraduate Students
Emily Coogan	University of Illinois
Tianyu Wang	Georgia Tech

Landmark Based Speech Recognition

Closing Remarks

Opening Remarks

Seminar Information

Resources

Tutorials

Preliminary Activities

Upcoming Seminars

Center for Language and Speech Processing