From "Resource Management" to "Call Home" - A Little Science, a Little Art, and Still a Long Way to Go

Andrej Ljolje, AT&T Labs - Research

April 21, 1998


Hidden Markov Models (HMMs) were well established in the late eighties during the height of the Resource Management evaluations. They have been so successful that they form the basis of virtually all speech recognition systems today. In the following years, most of the research effort was devoted to speaker adaptation and improving recognizer structure within the HMM framework (phoneme context dependency clustering, pronunciation modeling). Large improvements in performance have also been achieved on very small tasks (digits, spelled letters) using discriminative training to minimize empirical error rate and signal conditioning techniques. Additional small improvements were achieved using segmental duration modeling and explicit modeling of correlations, either across observation parameters or over time. With the advent of tasks such as Switchboard and Call Home where the speech is collected in a more natural setting and where the word error rate was initially twice as high as the word accuracy, it was clear that more needed to be done than just collection of more data. This resulted in widespread use of Vocal Tract Length Normalization (VTLN) and Speaker Adaptive Training (SAT).
Despite of all the new acoustic modeling techniques, three observations dominate the current perception of the modeling field:

  1. A mismatch between the training speech and test speech (different microphone, spectral filtering, speaking style, noise, echo etc.) can cause drastic degradation in recognition performance;
  2. There is evidence that the speech transcription differences between human transcribers on Named Entities are much closer to automatic speech recognition performance, than the differences on function words. Acoustic modeling dominates in the case of Named Entities and language/semantic modeling in the case of function words;
  3. Baseline performance using HMMs still determines the final recognition performance for different recognition systems, as additional techniques seem to consistently improve performance across systems, regardless of the baseline performance.

The science gave us the improvements with the new modeling techniques, the art still dominates the baseline performance and we have a long way to go to approach human robustness to environment changes and use of syntactic and semantic knowledge in recognizing speech.


Dr. Andrej Ljolje grew up in Croatia. He was awarded B.Sc. degree in Cybernetics and Control Engineering (with Mathematics) from University of Reading, England in 1982. His Ph.D. degree in Speech Processing was awarded by University of Cambridge, England, in 1986. From 1985 to 1987 he was a Research Fellow at Trinity Hall, Cambridge. He spent a year at AT&T Bell Labs as a Post-doc, where he remained until the trivestiture in 1996 as a Member of Technical Staff. Since then he has been with AT&T Labs as a Principal Technical Staff Member. His work has been primarily in acoustic modeling for tasks ranging from a few words to unlimited vocabularies.