Acoustic feature vectors typically represent short-term characteristics of the speech signal. Standard HMM-based systems do classification over this short time span under the assumption of independence of the short-term acoustic vectors. There is a growing evidence that the peripheral human auditory system can effectively integrate rather large time-spans (around 200 ms) of the audio signal.
The following experiments conducted during WS96 on a subset of the Switchboard data indicate that time-spans of about 200-300 ms duration of the signal are important in coding the linguistic information in speech:
The cross correlation matrix
between the time trajectories
(length of trajectory
msec) of N critical band energies at time t was
computed. The average cross correlation at time t was then computed as:

where o is a vector of ones. The above operation basically computes the average of all the elements of the matrix and represents the average cross correlation between the various critical band time trajectories at time t.
The average cross correlation for time-span
is given by

It was observed (Fig.1) that
was highest
when using around 200ms of speech data (i.e.
=200). This indicates the underlying rythmically organized sequence of articulatory
movements with period of about 200 msec.
LDA analysis was performed on the hand-labeled subset of the Switchboard data. The vector space for the LDA was constructed from segments of time trajectories of a single speech feature over a relatively long span of time. It was observed that the first few discriminant vectors resulting from LDA analysis effectively perform FIR filtering of the trajectories with a dominant weighting of evidence from about 200-300 ms of data centered around the current time instant. This indicates that the linear separability between phonetic classes in the presence of linear distortions could be improved by weighting the evidence from about 200-300 ms of the speech signal.
The frequency response (Fig.2) and the impulse response (Fig.3) of the first discriminant vector is consistent with the current RASTA filter, the second discriminant vector exhibits a frequency response with two resonant peaks, the higher peak at about 10 Hz , and the third discriminant vector effectively differentiates the trajectory (i.e. computes delta features).
Last modified on October 16, 1996