CLSP
WORKSHOP '96

Syllable-Length Time-Spans


Use of syllable-length time-spans in deriving speech feature vectors

The motivation

Acoustic feature vectors typically represent short-term characteristics of the speech signal. Standard HMM-based systems do classification over this short time span under the assumption of independence of the short-term acoustic vectors. There is a growing evidence that the peripheral human auditory system can effectively integrate rather large time-spans (around 200 ms) of the audio signal.

The following experiments conducted during WS96 on a subset of the Switchboard data indicate that time-spans of about 200-300 ms duration of the signal are important in coding the linguistic information in speech:

The cross-correlation evidence

The cross correlation matrix between the time trajectories (length of trajectory msec) of N critical band energies at time t was computed. The average cross correlation at time t was then computed as:

where o is a vector of ones. The above operation basically computes the average of all the elements of the matrix and represents the average cross correlation between the various critical band time trajectories at time t.

The average cross correlation for time-span is given by

It was observed (Fig.1) that was highest when using around 200ms of speech data (i.e. =200). This indicates the underlying rythmically organized sequence of articulatory movements with period of about 200 msec.


Figure 1: Cross-correlation curve


Linear discriminant analysis (LDA)

LDA analysis was performed on the hand-labeled subset of the Switchboard data. The vector space for the LDA was constructed from segments of time trajectories of a single speech feature over a relatively long span of time. It was observed that the first few discriminant vectors resulting from LDA analysis effectively perform FIR filtering of the trajectories with a dominant weighting of evidence from about 200-300 ms of data centered around the current time instant. This indicates that the linear separability between phonetic classes in the presence of linear distortions could be improved by weighting the evidence from about 200-300 ms of the speech signal.


Figure 2: Frequency response of the first three discriminant vectors of LDA.


Figure 3: Impulse response of the first three discriminant vectors of LDA.

The frequency response (Fig.2) and the impulse response (Fig.3) of the first discriminant vector is consistent with the current RASTA filter, the second discriminant vector exhibits a frequency response with two resonant peaks, the higher peak at about 10 Hz , and the third discriminant vector effectively differentiates the trajectory (i.e. computes delta features).



Last modified on October 16, 1996
Christophe Ris <ris@cspjhu.ece.jhu.edu >