Data-driven Speech Analysis for ASR: The Need for Syllable-length Context – Hynek Hermansky (Oregon Graduate Institute of Science and Technology)

February 2, 1999 all-day

A typical large vocabulary automatic speech recognition (ASR) system consists of three main components: 1) the feature extraction, 2) the pattern classification, and 3) the language modeling. Replacing hardwired prior knowledge in the pattern classification and language modeling modules by the knowledge derived from the data turned out to be one of most significant advances in ASR research in past two decades. However, the speech analysis module so far resisted the recent data-oriented revolution and is typically built on textbook knowledge of speech production and perception. Our current research aims at extending the data-driven notion to speech analysis.Since speech was optimized by millennia of human evolution to serve its purpose of communication through imperfect production-environment-perception channel, it carries imprints of the channel. It could, therefore, be gratifying but would come as no surprise if the data-driven analysis yields solutions consistent with properties of human speech production and perception.In the talk we first describe our efforts to use the concept of mutual information used on a relatively large phoneme hand-labeled database of fluent speech to estimate distribution of information which is most relevant for phoneme classification in the time-frequency plane. We demonstrate that this information is distributed over a significant time interval around the given phoneme. The Linear Discriminant Analysis is then used to derive optimized spectral basis functions and filters (replacing conventional cosines of the cepstral analysis and conventional RASTA and delta filters for deriving dynamic features) in processing the time-frequency plane of the speech signal.The last part of the talk describes our initial efforts for departing from the conventional assumption of importance of across-spectrum features (as is e.g. cepstrum) and towards frequency localized classifiers of relatively long (about 1-sec) temporal patterns of critical-band spectral energies.Work supported by the Department of Defense and by the National Science Foundation.

Center for Language and Speech Processing