New Directions in Robust Automatic Speech Recognition – Richard Stern (Carnegie Mellon University)
Abstract
As speech recognition technology is transferred from the laboratory to the marketplace, robustness in recognition is becoming increasingly important. This talk will review and discuss several classical and contemporary approaches to robust speech recognition. The most tractable types of environmental degradation are produced by quasi-stationary additive noise and quasi-stationary linear filtering. These distortions can be largely ameliorated by the “classical” techniques of cepstral high-pass filtering as exemplified by cepstral mean normalization and RASTA filtering, as well as by techniques that develop statistical models of the distortion such as codeword-dependent cepstral normalization and vector Taylor series expansion. Nevertheless, these types of approaches fail to provide much useful improvement in accuracy when speech is degraded by transient or non-stationary noise such as background music or speech. We describe and compare the effectiveness of techniques based on missing-feature compensation, multi-band analysis, feature combination, and physiologically-motivated auditory scene analysis toward providing increased recognition accuracy in difficult acoustical environments.