| CLSP Homepage : Workshop Homepage | |
![]() | |
| Workshop 2004 | Saturday, November 7, 2009 |
Stevens proposed improving the accuracy and noise robustness of speech recognition by eliminating frame-based observation PDFs, and instead focusing the attention of the classifier on instantaneous phonetic landmarks. Phonetic landmarks are articulatory events with acoustic signatures easily detected at low SNR, including syllabic nuclei, intervocalic glides, and onsets and offsets of the distinctive features [sonorant] and [continuant]. Stevens proposed that acoustic observations should be optimized separately for each landmark type, e.g., measures of signal periodicity sampled once per 10ms for the detection of syllable nuclei, but energies sampled once per 1ms for the detection of stops.
Following Stevens' proposal, Niyogi, Ramesh, and Burges used kernel-based SVMs to detect stop consonants in TIMIT: http://www.cs.uchicago.edu/research/publications/techreports/TR-2002-02 . Briefly, an SVM is just a feedforward artificial neural network (ANN) with an unusual training algorithm. Specifically, instead of being trained in order to minimize training corpus error (as are most ANN), an SVM is trained in order to minimize an upper bound on the expected value of the test corpus error. This upper bound includes two terms: (1) the training corpus error, (2) the magnitude of the network output layer weight vector. The upper bound works only if weights on the input layer are not trained in the usual sense; instead, hidden nodes are actually examples drawn from the training corpus. The training algorithm chooses the most confusable examples: the so-called "support vectors."
A landmark is intended to be (1) reliably identifiable in a variety of noise environments, (2) a good reference time for measurements of spectral dynamics. Acoustic phonetic studies use a number of explicitly time-sensitive measurements that are hard to use in speech recognition, because accurate landmark times have not been previously available to the speech recognizer: examples include burst spectra, formant transitions, and voice onset times (Niyogi and Ramesh, 2003, Speech Communication). Hasegawa-Johnson [link] measured the mutual information between phoneme label and spectral energy, as a function of time and frequency, with time measured relative to a landmark. The resulting plots ("infograms," shown here: [link]) show high mutual information in precisely the locations predicted by acoustic phonetics: burst spectrum, formant transitions, voice bar, frication peak. Omar and Hasegawa-Johnson proposed modeling landmarks, in an HMM-based phoneme recognizer, using a non-recurrent HMM boundary state (ASRU 2001: [link]). The extra "landmark state" improved recognition slightly but significantly.
Juneja and Espy-Wilson developed SVM-based landmark detectors for onsets and offsets of the distinctive features [silence] (94% recognition accuracy), [syllabic] (79% accuracy), [sonorant] (93%), and [continuant] (94%). Six-manner-class recognition accuracy on TIMIT was 80%, using a total of 160 trainable parameters. See Juneja's thesis proposal and papers at http://www.glue.umd.edu/~juneja/ . These results may be compared to the performance of a six-class mixture Gaussian HMM trained by Borys and Hasegawa-Johnson for the same task; with 22716 trainable parameters, the best tested HMM architecture achieved a six-class recognition accuracy of up to 74%.
In preparation for this workshop, Borys and Hasegawa-Johnson conducted an oracle experiment in order to test feasibility of landmark-based speech recognition for lattice rescoring. State of the art recognition lattices for this dataset were unavailable, because the WS97 test set has been folded into the standard Switchboard training set. Lacking state of the art recognition models, we trained 3-mixture monophone models in HRest using the (small) WS97 training corpus. 500-best word lattices were then computed for the WS97 test corpus. The resulting lattices were far below state of the art (80% WER!), but gave us a benchmark for experimentation. An ideal "landmark-based speech recognizer" was simulated by converting the Steve Greenberg's ICSI Switchboard phoneme transcriptions (http://www.icsi.berkeley.edu/real/stp/index.html) into presumably error-free landmark-based distinctive feature transcriptions. Relmax-trimmed landmark-based dictionaries were created using Fosler-Lussier's babylex program (see his dissertation at http://www.cis.ohio-state.edu/~fosler/publications.html). Word lattices were rescored by penalizing every difference between a word's best-matching dictionary entry and the manual transcription. Rescoring in this way was entirely uneffective until lattices were pinched to the maximum likelihood transcription using methods presented by Byrne at Eurospeech 2003 (and other places). After lattices were pinched to the ML transcription, rescoring resulted in a WER improvement. Using only manner features, word error rate of the maximum likelihood path through the lattices dropped by 2% relative (2% absolute). Using all distinctive features, word error rate dropped by almost 12% relative (9% absolute).
| The Center for Language and Speech Processing The Johns Hopkins University 3400 North Charles Street, Barton Hall Baltimore, MD 21218 | |||||
| Telephone: (410) 516-4237 | Fax: (410) 516-5050 | E-mail: clsp@clsp.jhu.edu | |||