This model uses vocal tract resonance (VTR) as the partially "hidden" (continuous)
state to piece together all segments ("phones") in an utterance.
It uses the explicit continuity constraint to incorporate the correlation
among the segments ("phones") into the statistical model for the speech dynamics.
Technically, this is a "super-segmental" model where each statistically
defined "segment" can be as large as the entire utterance (due to the
constraint imposed across the entire utterance).
Details of the model (including model construction and training/recognition
algorithms) can be found in Section 4.3.2 of the article
``A dynamic, feature-based approach to the interface
between phonology and phonetics for speech modeling and recognition,''
Speech Communication, 1998.
Another short description of the model can be found in my proposal
presented at the Airlie meeting October 1997
click here.
Shown in the example is a spectrogram with formant trajectories superimposed (from xwaves), then data (formants) fitting results using a model for the VTR dynamics. The example shows how the modeled trajectory smooths out the noise from the data computed from xwaves.
We are using these values as the F1/F2/F3 targets (Hz in unit)
to initialize the stochastic dynamic model parameters.
They are based on the Klatt synthesizer setup.
These initial time-constant parameters are based on the consideration that the articulators responsible for producing different phones have different intrinsic movement rates. (This difference translates to the difference in the VTR movement rates among the varying phones.) For example, VTR transitions for labial consonants (/b/, /m/, /p/) are significant faster than those for alveolar consonants with tongue-blade features (/d/, /t/, /n/,...), and still faster than those for velar consonants and most vowels with tongue-dorsum features. These initial values, however, are subject to iterative EM-like training before the recognition (rescoring) stage. Ultimately, we will use Bayesian strategy to model the time-constant parameters by positive-valued statistical distrubutions. But in this summer workshop, we will use deterministic time-constant parameters.
The tying of the MLP training (nonlinear mapping from VTR to MFCC) is made according to the following grouping (based on the consideration of discrimination of phones using differential VTR target information and using differential nonlinear mapping from VTR to MFCC):
An MLP (program written by Mike Schuster) is used for the part of the DSM model, which maps internal states to feature vectors, in our case MFCCs. Currently we use as internal states formants F1/F2/F3 that were generated with X-waves. The following output was used to check how good the MLP is in predicting MFCCs from the formant data. Output and reference (true) MFCCs for several frames from phone 'ay' (vowel) predicted from the first three formants:
Output and reference MFCCs for several frames from phone 'l' (consonant) predicted from the first
three formants:
In this particular experiment, 18 utterances were selected. Alternate transcriptions were hand-generated, force-aligned, and fed to a conventional HMM recognizer and also to successive revisions of the DSM model. The goal was to see how often the system would choose the reference transcription over the alternate transcriptions, and to see if the DSM model met this goal better than the HMM system.
For testing, 18 utterances were chosen from conversation 3107A: 0001, 0004, 0011, 0013, 0014, 0021, 0028, 0029, 0030, 0033, 0034, 0035, 0037, 0038, 0041, 0043, 0044, and 0048.
In the following table, the numbers are the rank of the reference transcription with respect to all six hypotheses. They range from 1 to 6, and 1 is best. The "DSM #" ranks refer to different revisions of version 1 of the model. These are all DSM version 1. The numbers (e.g. DSM 1, DSM 3.5) are all revisions within version 1. each using a log likelihood measure (i.e. Score of MFCCs plus Score of Zk's). Clicking on the utterance # will give you the text of the utterance, the exact HMM scores, and the actual audio recording of the utterance.
| Utterance # | HMM Rank | DSM 1 Rank | DSM 2 Rank | DSM 3 Rank | DSM 3.5 Rank | DSM 4 Rank |
Note: since these ranks are ranks of log probabilities, and they are all on different scales, the average isn't statistically valid.
Average Word Error Rate (WER) was computed by Sandi, using the NIST program. We calculated the WER of the top choice of the recognizer for each utterance, then averaged across all utterances.
This table is similar to the one above, except a different statistical measure was used in the model to rank the hypotheses: the score of the MFCCs, ignoring the Zk's. It suggests that the total likelihood measure (MFCC score + Zk score, the one which gave the results in the above table) is more useful than just the score of the MFCCs.
Benchmark results were obtained on the same recognition tasks using a good HMM system. (The HMM system is one of the best in ws97, with word-internal triphones clustered by decision tree, with bigram language model, and with word error rate of 49.1%). Note: the total number of HMM model parameters is: 39 (feature-vector dim) x 12 (no. of mixtures) x 2 (mean plus variance) x 3500 (no. of HMM states clustered by DT) = 3,276,000. In contrast, the total number of the VTR-dynamic system recognizer is: 42 (no. of "symbols" including 8 CD phones) x 3 (F1/F2/F3 targets T) + 42 x 3 (F1/F2/F3 Phi's) + 10 (classes of MLP) x 100 (no. of hiddeng units) x 12 (no. of output MFCC units) + 10 x 100 x 3 (no. of input Zk units) = 15,252.
Details and graphs of the HMM scoring for the first batch of 18 utterances are available here.
The rescoring results for DSM 1 are contained in here. We used Phi and T parameters that were manually set (in a first-order dynamic system). The scores are arranged according to the following order:
Correct hypothesis Incorrect hypothesis 1 Incorrect hypothesis 2 Incorrect hypothesis 3 Incorrect hypothesis 4 Incorrect hypothesis 5
The results for DSM 2 are here. They are essentially the same as the above, except Phi/T parameters for one phone (/uw/) are manually modified (to test the system sensitivity), and also we show both the Score of MFCCs (LikeO) and Score of Zk's (LikeZ) in addition to the total score. (Score of Zk's alone is not tabulated as the results were judged to be too poor.)
The results for DSM 3 are here. These were obtained by re-training MLP weights (but only 42 files of conversation 2149A) using manually segmented phone boundaries. It took me about 3 hrs on July 22 to segment these 42 files using xwaves and using a PERL script written by Terri (in ~/tmp_Terri), which considerably speeds up this process (thanks a lot!).
Next, the results for DSM 3.5 are here. These and are obtained by training one iteration of the EM algorithm using HMM-segmented phone boundaries for the MLP and using my "magic" tables to initialize the Phi and T parameters.
Finally, in DSM 4, we ran the recoring experiments using my hand-segmented boundaries on the test data based upon my notion of how the timing of the phones should be in the type of the model we are using here (It took me about 1.5 hrs on July 23 to segment the 18 files). The result, shown here, are very interesting --- for half (9) utterances (out of 18), we got top-one scores, but for a few files, we got bottom scores. I need to go back to examine why this is the case!
These simple tests demonstrate that the model is behaving reasonably well. The correct hypotheses generally give better scores than the alternative hypotheses.
It appears that the boundary information suitable to fit the notion of the VTR-dynamic initiation (symbol dependent, not the same as the phone boundaries such as those marked in TIMIT and ICSI-switchboard data) directly affects the recognizer's performance.
| Utterance # | HMM Rank | DSM 4 Rank (auto align) | DSM 4 Rank (manual align) | DSM 4 Rank (async shift) |
Note: since these ranks are ranks of log probabilities, and they are all on different scales, the average may not be statistically valid.
DSM 4 auto align: The results, shown here, are reasonable (considering cross-speaker). (the errors for the 10 individual utterances are counted as 8/11, 0/10, 2/18, 2/8, 1/17, 2/4, 2/7, 3/3, 0/10, 3/13, 7/28, respectively; these total to 28/111 which is 25.7%)
DSM 4 manual align: The rescoring results (the second part of the experiment) are very instructive, shown here. Will analyse the error patterns in terms of Zk fitting and Ok fitting soon.
DSM 4 async shift: Although there is a training/testing discrepancy in the third part of this experiment, (and no constraints have < been implemented in EKF), it is not too discouraging to see that this 2-D "feature" overlapping implementing gives reasonable results, as shown here.
Note: since these ranks are ranks of log probabilities, and they are all on different scales, the average may not be statistically valid.
HMM: It makes sense that the HMM rejected most of the reference transcriptions, because it's rescoring its own best hypotheses. The only reason the reference won once was because of the language model score, which was used for n-best generation but not for rescoring.
DSM N1 (auto align): The details of the results of the new (after all Rescoring Experiments 2) N-best (N=5 plus reference) experiments are in here. No 2-D feature-overlaps; simple training; small size of training data; no constraint on Zk is built into the EKF on rescoring.
Details of more experiments (using 2-D feature-overlaps) on the same N-lists of the 10 utterances are provided in the following files.
DSM N2 (auto align): See here for the results of the experiment with two targets and two system matrices (as well as 9 MLPs) trained automatically by 3 iterations of the EM algorithm. All, except one conversation, data from speaker 1028 are used in training.
DSM N2 (manual align): See here for the results of the same experiment as above except the reference transcriptions of the 10 utterances have been slightly modified by hand.
DSM N3 (auto align): See here for the results of the experiment where silence model (no targets!) is used which is a single-mode 12-Dim Gaussian distribution on the MFCCs with 12-D means obtained by averaging silence frames in the training data. The (diagonal) covariance matrice is fixed by hand (important).
DSM N3 (manual align): See here for the results of the same experiment as above except the reference transcriptions of the 10 utterances have been slightly modified by hand.
All of the above experiments are based on Version One of the VTR-dynamic model on a small amount of test data.
Our team are now preparing a total of 1000 Male test utterances from the SWBD-97 development set based on the evaluation scheme established in the July 28 meeting with combined inputs from George Doddington, Joe Picone, John Bridle and myself (Deng).
We are also preparing Version 2, 3, and 4 of the VTR-dynamic model to be tested on the above "full" set of test data. In summary,