Testing of THE DYNAMIC-VTR SEGMENT MODEL
Li Deng and Roland Reagan, July/August 1998



Some initial experiments with the segmental VTR-dynamic model --- diagnostics and rescoring on limited test utterances of the Switchboard data   

Overview

This model uses vocal tract resonance (VTR) as the partially "hidden" (continuous) state to piece together all segments ("phones") in an utterance. It uses the explicit continuity constraint to incorporate the correlation among the segments ("phones") into the statistical model for the speech dynamics. Technically, this is a "super-segmental" model where each statistically defined "segment" can be as large as the entire utterance (due to the constraint imposed across the entire utterance). Details of the model (including model construction and training/recognition algorithms) can be found in Section 4.3.2 of the article ``A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition,'' Speech Communication, 1998. Another short description of the model can be found in my proposal presented at the Airlie meeting October 1997 click here.


Examples

Here are some examples to test the parameter-learning algorithm.

Shown in the example is a spectrogram with formant trajectories superimposed (from xwaves), then data (formants) fitting results using a model for the VTR dynamics. The example shows how the modeled trajectory smooths out the noise from the data computed from xwaves.


Initial VTR Targets

We are using these values as the F1/F2/F3 targets (Hz in unit) to initialize the stochastic dynamic model parameters. They are based on the Klatt synthesizer setup.


Initial Time Constant Parameters

Here are the initial Time Constant Parameters (Phi's) for the first-order dynamic system model (exponential form of the modeled VTR-trajectories).

These initial time-constant parameters are based on the consideration that the articulators responsible for producing different phones have different intrinsic movement rates. (This difference translates to the difference in the VTR movement rates among the varying phones.) For example, VTR transitions for labial consonants (/b/, /m/, /p/) are significant faster than those for alveolar consonants with tongue-blade features (/d/, /t/, /n/,...), and still faster than those for velar consonants and most vowels with tongue-dorsum features. These initial values, however, are subject to iterative EM-like training before the recognition (rescoring) stage. Ultimately, we will use Bayesian strategy to model the time-constant parameters by positive-valued statistical distrubutions. But in this summer workshop, we will use deterministic time-constant parameters.


MLP Tying

The tying of the MLP training (nonlinear mapping from VTR to MFCC) is made according to the following grouping (based on the consideration of discrimination of phones using differential VTR target information and using differential nonlinear mapping from VTR to MFCC):

  1. aw ay ey ow oy aa ae ah ao ax ih iy uh uw er eh el
  2. l w r y
  3. f th sh
  4. s ch
  5. v dh zh
  6. z jh
  7. p t k
  8. b d g
  9. m n ng en
  10. sil sp
A total of 10 distinct MLPs are used. All vowels are tied using one MLP, because vowel class distinction is based on different target values in the VTR domain. For example, /s/ and /sh/ have separate MLPs, because their target VTR values are similar to each other (in terms of their attraction of VTR transitions from the adjacent phones) and hence their distinction will be based mainly on the different VTR-to-MFCC mappings. In this case, the greater energy in lower frequency for /sh/ than for /s/ in the acoustic domain is modeled by different MLP weights (which are trained), rather than by different VTR target values (because the behavior of attracting adjacent phones' VTR transitions is similar between them).

Other Examples

Here are some more examples that need to be examined for the DSM model.

Testing The MLP

An MLP (program written by Mike Schuster) is used for the part of the DSM model, which maps internal states to feature vectors, in our case MFCCs. Currently we use as internal states formants F1/F2/F3 that were generated with X-waves. The following output was used to check how good the MLP is in predicting MFCCs from the formant data. Output and reference (true) MFCCs for several frames from phone 'ay' (vowel) predicted from the first three formants:

Outputs:
Targets:


Output and reference MFCCs for several frames from phone 'l' (consonant) predicted from the first three formants:

Outputs:
Targets:

Preliminary Rescoring Experiments

In the preliminary rescoring experiments, only subsets of training and test sets are used, and some model parameters are fixed by hand (myself, from general knowledge of acoustic phonetics and from spectrogram analysis) rather than by training. These sets of results are obtained using only the speech data from speaker 1028 (a total of 217.1 min of speech data, from ws97_train). The data are broken down to a total of 24 conversations: 6 conversations are used as test data, and the remaining 18 conversations as training data. Due to the use of only a single speaker, we avoid normalization problems for VTR targets and for MFCC observations.

Rescoring Experiment 1 (same speaker; artificial hypotheses)

Overview

In this particular experiment, 18 utterances were selected. Alternate transcriptions were hand-generated, force-aligned, and fed to a conventional HMM recognizer and also to successive revisions of the DSM model. The goal was to see how often the system would choose the reference transcription over the alternate transcriptions, and to see if the DSM model met this goal better than the HMM system.

Data

The following results are obtained using only one conversation (2149A) in training. Note: No training of Phi and T parameters is done yet. The only training done is the MLP weights using all utterances (files) in 18 conversations of speaker 1028.

For testing, 18 utterances were chosen from conversation 3107A: 0001, 0004, 0011, 0013, 0014, 0021, 0028, 0029, 0030, 0033, 0034, 0035, 0037, 0038, 0041, 0043, 0044, and 0048.

Design

  1. For each of the 18+10 reference transcriptions, 5 alternate hypotheses were generated. These alternate hypotheses were hand-generated with the help of an error log of common confusions made by HMMs on Switchboard experiments. The error log was used to find pairs of commonly confused word pairs, which were used as a basis for creating the alternate hypotheses. This heuristic and also subjective acoustic similarity were taken into account when creating the alternate hypotheses.
  2. All (18+10)*6 transcriptions were aligned with the acoustics by an HMM model (not very good). This is how we get the phone boundaries.
  3. An HMM model was run on the aligned data to produce log probability scores. The model was the WS'97 Syllable Team's model based on wrdi models / wrdi decoding / triphones.
  4. The DSM model was also run on the aligned data to produce log probability scores.
  5. For each utterance and each model, the six scores were sorted and ranked. We measure success based on whether the rescorer ranked the reference transcription higher than all five alternate transcriptions. Therefore, success is a rank 1, and chance is rank 3.5. Because of the fact that a rank 2 is better than a rank 6, the actual ranks are shown and averaged in the tables below.

Results

In the following table, the numbers are the rank of the reference transcription with respect to all six hypotheses. They range from 1 to 6, and 1 is best. The "DSM #" ranks refer to different revisions of version 1 of the model. These are all DSM version 1. The numbers (e.g. DSM 1, DSM 3.5) are all revisions within version 1. each using a log likelihood measure (i.e. Score of MFCCs plus Score of Zk's). Clicking on the utterance # will give you the text of the utterance, the exact HMM scores, and the actual audio recording of the utterance.

Utterance #HMM RankDSM 1 RankDSM 2 RankDSM 3 RankDSM 3.5 RankDSM 4 Rank
0001
1
4
3
3
2
1
0004
2
1
1
1
1
1
0011
2
4
4
4
3
1
0013
5
4
4
4
5
6
0014
2
3
3
3
2
2
0021
3
4
4
5
2
2
0028
1
4
1
1
3
5
0029
1
1
1
2
1
3
0030
4
3
3
3
3
6
0033
1
1
1
2
2
1
0034
1
1
1
3
2
1
0035
1
1
1
1
1
2
0037
3
3
3
4
5
1
0038
4
4
4
4
4
4
0041
1
6
6
6
6
1
0043
6
2
2
2
3
3
0044
1
1
1
2
1
1
0048
3
3
3
3
2
1
Average
2.33
2.78
2.56
2.94
2.67
2.33
% Correct
44.0
33.3
38.8
16.7
22.2
50.0
Avg WER
39.2%
34.2%
30.4%
48.1%
41.8%
22.8%

Note: since these ranks are ranks of log probabilities, and they are all on different scales, the average isn't statistically valid.

Average Word Error Rate (WER) was computed by Sandi, using the NIST program. We calculated the WER of the top choice of the recognizer for each utterance, then averaged across all utterances.

This table is similar to the one above, except a different statistical measure was used in the model to rank the hypotheses: the score of the MFCCs, ignoring the Zk's. It suggests that the total likelihood measure (MFCC score + Zk score, the one which gave the results in the above table) is more useful than just the score of the MFCCs.

Benchmark results were obtained on the same recognition tasks using a good HMM system. (The HMM system is one of the best in ws97, with word-internal triphones clustered by decision tree, with bigram language model, and with word error rate of 49.1%). Note: the total number of HMM model parameters is: 39 (feature-vector dim) x 12 (no. of mixtures) x 2 (mean plus variance) x 3500 (no. of HMM states clustered by DT) = 3,276,000. In contrast, the total number of the VTR-dynamic system recognizer is: 42 (no. of "symbols" including 8 CD phones) x 3 (F1/F2/F3 targets T) + 42 x 3 (F1/F2/F3 Phi's) + 10 (classes of MLP) x 100 (no. of hiddeng units) x 12 (no. of output MFCC units) + 10 x 100 x 3 (no. of input Zk units) = 15,252.

Details and graphs of the HMM scoring for the first batch of 18 utterances are available here.

The rescoring results for DSM 1 are contained in here. We used Phi and T parameters that were manually set (in a first-order dynamic system). The scores are arranged according to the following order:

Correct hypothesis
Incorrect hypothesis 1
Incorrect hypothesis 2
Incorrect hypothesis 3
Incorrect hypothesis 4
Incorrect hypothesis 5

The results for DSM 2 are here. They are essentially the same as the above, except Phi/T parameters for one phone (/uw/) are manually modified (to test the system sensitivity), and also we show both the Score of MFCCs (LikeO) and Score of Zk's (LikeZ) in addition to the total score. (Score of Zk's alone is not tabulated as the results were judged to be too poor.)

The results for DSM 3 are here. These were obtained by re-training MLP weights (but only 42 files of conversation 2149A) using manually segmented phone boundaries. It took me about 3 hrs on July 22 to segment these 42 files using xwaves and using a PERL script written by Terri (in ~/tmp_Terri), which considerably speeds up this process (thanks a lot!).

Next, the results for DSM 3.5 are here. These and are obtained by training one iteration of the EM algorithm using HMM-segmented phone boundaries for the MLP and using my "magic" tables to initialize the Phi and T parameters.

Finally, in DSM 4, we ran the recoring experiments using my hand-segmented boundaries on the test data based upon my notion of how the timing of the phones should be in the type of the model we are using here (It took me about 1.5 hrs on July 23 to segment the 18 files). The result, shown here, are very interesting --- for half (9) utterances (out of 18), we got top-one scores, but for a few files, we got bottom scores. I need to go back to examine why this is the case!

Summary

These simple tests demonstrate that the model is behaving reasonably well. The correct hypotheses generally give better scores than the alternative hypotheses.

It appears that the boundary information suitable to fit the notion of the VTR-dynamic initiation (symbol dependent, not the same as the phone boundaries such as those marked in TIMIT and ICSI-switchboard data) directly affects the recognizer's performance.


Rescoring Experiment 2 (cross speakers; artificial hypotheses)

Overview

Similar to the previous experiment (same speaker, 1028, for training and testing). 10 utterances were selected from the test set. Alternate transcriptions were hand-generated, force-aligned, and fed to a conventional HMM recognizer and also to the DSM model. The goal was to see how often the system would choose the reference transcription over the alternate transcriptions, and to see if the DSM model met this goal better than the HMM system. The key difference is that these utterances were chosen from the test set instead of the training set, and the model was not trained on this speaker.

Data

For the second phase, 10 utterances were chosen from the test set, specifically conversation 2724B. The utterances were: 0007, 0008, 0009, 0017, 0018, 0019, 0031, 0044, 0063, and 0064.

Design

  1. This is a really challenging test of our model/recognizer! Joe selected a new speaker, 1087, for which he has lattices derived from a our benchmark HMM system. The VTR-dynamic model/recognizer's parameters are trained using speaker 1028's data only. (MLP weights are trained using utterances in 18 conversations of speaker 1028, with phone boundaries automatically computed from an HMM system. Phi and T are trained with one iteration (nonlinear regression). No speaker normalization is attempted.
  2. I then spent 0.5 hr (July 24) of time to manually segment the 10 test utterances and ran the identical rescorer.
  3. Then, I ran the same experiment as above (using HMM-generated phone boundaries) except in the rescoring program we asynchronously shift the control regions of the Phi/T and MLP Parameters. (This is a crude way of implementing 2-D "feature" overlap phonological model. We are in the process of implementing the same in the training.

Results

The following table displays the results of testing on 10 sentences from the test database. Our model was not trained on this speaker. The same procedure as above was used for the generation of the alternative hypotheses. Each column of the table corresponds to one variation of this experiment (the variations are explained above, under the Design heading.)

Utterance #HMM RankDSM 4 Rank (auto align)DSM 4 Rank (manual align)DSM 4 Rank (async shift)
0007
3
2
1
2
0008
1
1
1
1
0009
3
2
2
2
0017
1
2
2
4
0018
3
5
4
2
0019
1
4
3
3
0031
4
4
4
4
0044
3
1
1
6
0063
3
4
1
4
0064
1
2
1
3
Average
2.30
2.70
2.00
3.10
% Correct
40.0
20.0
50.0
10.0
Avg WER
17.4%
25.7%
9.2%
28.4%

Note: since these ranks are ranks of log probabilities, and they are all on different scales, the average may not be statistically valid.

DSM 4 auto align: The results, shown here, are reasonable (considering cross-speaker). (the errors for the 10 individual utterances are counted as 8/11, 0/10, 2/18, 2/8, 1/17, 2/4, 2/7, 3/3, 0/10, 3/13, 7/28, respectively; these total to 28/111 which is 25.7%)

DSM 4 manual align: The rescoring results (the second part of the experiment) are very instructive, shown here. Will analyse the error patterns in terms of Zk fitting and Ok fitting soon.

DSM 4 async shift: Although there is a training/testing discrepancy in the third part of this experiment, (and no constraints have < been implemented in EKF), it is not too discouraging to see that this 2-D "feature" overlapping implementing gives reasonable results, as shown here.


Speech Synthesis Experiments


Further Model Synthesis --- Comparing Use of Reference Transcription/Alignment and Use of Wrong Transcriptions/Alignments


Rescoring Experiment 3 (cross speakers; N-best hypotheses)

Overview

These are the results using the N-best list from our benchmark HMM Recognizer in comparison with several versions of the dynamic-VTR model. Only 10 utterances are used from conversation 2724B.

Results

Utterance #
HMM Rank
DSM N1 Rank (auto align)
DSM N2 Rank (auto align)
DSM N2 Rank (manual align)
DSM N3 Rank (auto align)
DSM N3 Rank (manual align)
0007
6
1
6
1
6
4
0008
1
2
1
1
1
1
0009
6
1
1
4
6
6
0017
6
1
1
1
1
2
0018
6
6
1
1
5
5
0019
6
2
4
1
5
1
0031
6
5
6
3
2
1
0044
6
6
4
4
1
1
0063
6
5
6
6
2
4
0064
6
1
1
1
1
1
Average
5.5
3
3.1
2.3
3
2.6
%Correct
10
40
50
60
40
50
Avg WER
72.5%
34.9%
30.3%
29.4%
32.1%
35.8%

Note: since these ranks are ranks of log probabilities, and they are all on different scales, the average may not be statistically valid.

HMM: It makes sense that the HMM rejected most of the reference transcriptions, because it's rescoring its own best hypotheses. The only reason the reference won once was because of the language model score, which was used for n-best generation but not for rescoring.

DSM N1 (auto align): The details of the results of the new (after all Rescoring Experiments 2) N-best (N=5 plus reference) experiments are in here. No 2-D feature-overlaps; simple training; small size of training data; no constraint on Zk is built into the EKF on rescoring.

Details of more experiments (using 2-D feature-overlaps) on the same N-lists of the 10 utterances are provided in the following files.

DSM N2 (auto align): See here for the results of the experiment with two targets and two system matrices (as well as 9 MLPs) trained automatically by 3 iterations of the EM algorithm. All, except one conversation, data from speaker 1028 are used in training.

DSM N2 (manual align): See here for the results of the same experiment as above except the reference transcriptions of the 10 utterances have been slightly modified by hand.

DSM N3 (auto align): See here for the results of the experiment where silence model (no targets!) is used which is a single-mode 12-Dim Gaussian distribution on the MFCCs with 12-D means obtained by averaging silence frames in the training data. The (diagonal) covariance matrice is fixed by hand (important).

DSM N3 (manual align): See here for the results of the same experiment as above except the reference transcriptions of the 10 utterances have been slightly modified by hand.

All of the above experiments are based on Version One of the VTR-dynamic model on a small amount of test data.


Future Rescoring Experiments

Our team are now preparing a total of 1000 Male test utterances from the SWBD-97 development set based on the evaluation scheme established in the July 28 meeting with combined inputs from George Doddington, Joe Picone, John Bridle and myself (Deng).

We are also preparing Version 2, 3, and 4 of the VTR-dynamic model to be tested on the above "full" set of test data. In summary,


Li Deng
Last modified: 8/6/98 14:20 by eca