A first issue in the design of the system is to determine the number and position of the frequency bands and to choose the acoustical parameters. Once these are determined, the approach presented will fundamentally consist in the combination of the output of multiple recognizers. Each sub-recognizer has its own acoustic model and can generate its own temporal alignment inside the pre-defined lexical sub-units (words, syllable, phones, ...).
Of course, there is less information in a sub-band than in the whole band; the partial decisions may thus be less reliable. To avoid too much flexibility in choosing the time-warping path, it is necessary to re-introduce some constraints at a higher level. This is done by forcing synchrony of the different independent frequency band recognizers at some level, as shown on this figure.
We note that, while recombination at the state level quite easy, it is no longer straightforward at any higher sub-word unit level (simply using the standard one-pass dynamic programming approach). Rather, the system can either use an approach based on the 2-level dynamic time-warping programming, or else an adaptation of HMM decomposition [5] which can be used to do multi-dimensional time-warping and recombination of the frequency sub-bands.
A draft version of a theoretical discussion of the multi-band paradigm, and more generally the temporal integration of multi-stream inputs is available in :
Approach
In this section we present the principles of our multiband system.
References
[1] H. Fletcher, "Speech and Hearing in Communication", New York - Krieger, 1953.
[2] J. Allen, "How do humans process and recognize speech ?", IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, pp. 567-577, 1994.
[3] P. Duchnowski, "A New Structure for Automatic Speech Recognition", PhD thesis, MIT, Sept. 1993
[4] N. Morgan, C. Wooters and H. Hermansky, "Experiments with temporal resolution for continuous speech recognition with multi-layer perceptrons", in Proc. of IEEE Workshop on Neural Networks for Signal Processing, pp. 405-410, 1991
[5] A. Varga and R. Moore, "Hidden Markov Model decomposition of speech and noise", in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, pp. 845-848, 1990
Last modified on October 16, 1996