CLSP
WORKSHOP '96

The Multiband Paradigm


The multiband paradigm

Introduction

Some experiments obtained by Fletcher [1][2] suggest that human auditory perception is based on decisions within narrow frequency bands that are processed independently of each other. Recombination of decisions from these sub-bands would be done at some intermediate level and in such a way that the global error rate is equal to the product of error rates in the sub-bands. Other physiological experiments suggest that human auditory system has a varying temporal resolution according to the frequency bands. Whether or not this are accurate statements for disparate bands in continuous speech (the relevant Fletcher's experiments were done with nonsense syllables using highpass or lowpass filters only), we see some good engineering reasons for considering some form of this sub-band approach:
  1. The message may be impaired (by noise or reverberation) only in some frequency bands. When recognition is based on several independent decisions from different sub-bands, the decoding of the linguistic message may not be severely impaired, as long as the remaining clean sub-bands supply sufficiently reliable information.
  2. Acoustic events, such as the transitions between more stationary segments of speech, do not necessarily occur at the same across the different frequency bands, which make the piecewise stationary assumption more fragile. The sub-band approach may have the potential advantage of relaxing the synchrony constraint inherent in current HMM systems.
  3. A better use of the temporal information and, more generally, a better time/frequency compromise can be obtained by optimizing independently each sub-recognizer in terms of temporal resolution and acoustic context.
  4. Different recognition strategies including the use of different acoustic parameters could be used in different frequency bands.
Preliminary work in this direction has recently been reported [3] and [4]. Although the recombination scheme in [3] was quite simple, and no optimisation of the frequency bands was performed, this work yielded results that were quite similar to the results of conventional full-band recognizers used for comparaison. However, this approach was not tested for conditions of narrowband noise degradation.

Approach

In this section we present the principles of our multiband system.

A first issue in the design of the system is to determine the number and position of the frequency bands and to choose the acoustical parameters. Once these are determined, the approach presented will fundamentally consist in the combination of the output of multiple recognizers. Each sub-recognizer has its own acoustic model and can generate its own temporal alignment inside the pre-defined lexical sub-units (words, syllable, phones, ...).

Of course, there is less information in a sub-band than in the whole band; the partial decisions may thus be less reliable. To avoid too much flexibility in choosing the time-warping path, it is necessary to re-introduce some constraints at a higher level. This is done by forcing synchrony of the different independent frequency band recognizers at some level, as shown on this figure.

We note that, while recombination at the state level quite easy, it is no longer straightforward at any higher sub-word unit level (simply using the standard one-pass dynamic programming approach). Rather, the system can either use an approach based on the 2-level dynamic time-warping programming, or else an adaptation of HMM decomposition [5] which can be used to do multi-dimensional time-warping and recombination of the frequency sub-bands.

A draft version of a theoretical discussion of the multi-band paradigm, and more generally the temporal integration of multi-stream inputs is available in :

References

[1] H. Fletcher, "Speech and Hearing in Communication", New York - Krieger, 1953.
[2] J. Allen, "How do humans process and recognize speech ?", IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, pp. 567-577, 1994.
[3] P. Duchnowski, "A New Structure for Automatic Speech Recognition", PhD thesis, MIT, Sept. 1993
[4] N. Morgan, C. Wooters and H. Hermansky, "Experiments with temporal resolution for continuous speech recognition with multi-layer perceptrons", in Proc. of IEEE Workshop on Neural Networks for Signal Processing, pp. 405-410, 1991
[5] A. Varga and R. Moore, "Hidden Markov Model decomposition of speech and noise", in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, pp. 845-848, 1990


Last modified on October 16, 1996
Christophe Ris <ris@cspjhu.ece.jhu.edu >