CLSP
WORKSHOP '96

In Search of a Better Starting Point


This page describes the experiments conducted through the work of Mark Ordowski

In Search of a Better Starting Point

At the start of the workshop I was under the impression that the ICSI baseline system was around 70% WER and the HTK baseline system was 50% WER. This caused us great concern. By the third week of the workshop a report was provided that gave the correct results of the
HTK baseline system. Sangita also trained and ran the baseline HTK system on the ICSI defined dev-test data (240 male utterances) and found the WER to be 60%. All comparisons in this paper should be made with this WER of 60%, not the accuracy rates presented in the HTK baseline system paper.

At the start of the workshop, while the Senior member were at Keele, England, there was concern that the Lexicon that was used to train the ICSI baseline system was too far off the task. The number of pronunciations and the durations of the monophone models was the main concern. Consequently, I started to look into the differences between the lexicon used for ICSI baseline system that produced the 70% WER and the lexicon that was used for the HTK baseline system.

Starting from this premise I explored two questions. First, did the switchboard lexicon have fewer or more pronounciations? Second, what effect does monophone durations have on the lexicon?

Looking at WS96 Lexical Pronounciations

In short, the lexicon that was used to train the baseline system (prior to workshop) had more words with multiple pronounciations than the WS96AMDict dictionary. The WS96AMDict was provided for the workshop that was based on switchboard pronounciations using the PRONLEX phone set. Looking at the cut0 training session completed by NIKKI before the workshop, I noticed that the number of words with multiple pronounciations was 16% at the start of training and after the last iteration of training only 6% of the words had multiple pronounciations.

The lexicon for the ICSI baseline system was built using pronounciations from many different tasks (i.e. WSJ). A question ... The number of words used in the switchboard corpus is much smaller than the WSJ task. Training a system with too many optional words may give the MLP a view of the data that is too broad. Therefore, if one were to reduce the number of allowed pronounciations, the initial MLP could have a better estimate for the phone emission probabilities. Multiple training iterations could be run to slowly expand the task.

This section is incomplete. I do have data showing the difference between the ICSI lexicon and the WS96AMdict lexicon, but I really need to redo the analysis because my notes from 6 weeks ago are not very informative to myself.

Looking at Monophone Durations

How phones can be compared from the PRONLEX to ICSI56 phone set

  • Following phones have a 1-1 Mapping from Pronlex (HTK) set and the ICSI 56 phone set.
  • Following have transformation rules from PRONLEX to ICSI56
  • The Following does not exist in PRONLEX and not Decoded by ICSI 56
  • The following phone has this interpretation
  • This reference compares the MONOPHONE durations of three systems. An ICSI trained system on the TIMIT database, WS96 HTK baseline system, and ICSI embedded trained Switchboard system from ICSI defined cut0 training files.
    Table of Phone Comparisons

    Where did TIMIT durations come from?

    The TIMIT durations come from a training done at ICSI on the TIMIT database. These durations are also used to create the BOOT lexicon that was used to train the ICSI system on switchboard data.

    Where did HTK durations come from?

    The HTK durations were computed by looking at the alignments from the baseline system provided for the workshop. The data covered is all the training data.

    Where did the ICSI Switchboard system durations come from?

    ICSI durations came from doing an embedded training on the ICSI defined switchboard cut0 training data. The durations used are from the last iteration of training. The data that the durations represent is 1/4 of the HTK baseline system.

    Observations

    Up to the time I had written this report, I had never taken a close look at the TIMIT durations. You will notice that the TIMIT has the shortest durations followed by HTK baseline system and then the ICSI trained system. I was under the impression until now that the TIMIT durations where longer than the HTK or ICSI trained systems. Looking at this data I'm lead to believe that the TIMIT data is faster spoken than the conversational switchboard data. I would have expected the TIMIT database to have a slower rate of speech. Another explanation is that in the switchboard database, phones are being dropped, and replaced by the neighbor phone by increasing its duration.

    I did spend time reviewing the data and checking to make sure that the information is in the correct format and I'm comparing apples to apples. I do believe that I'm comparing apples to apples.

    The experiment to find a better start point

    The objective of this experiment was to minimize the possible negative effect of using incorrect monophone durations and reduce the number of pronounciations in training in order to make the lexicon more task dependent. In short, a new lexicon was created that used the monophone durations from the last iteration of training of a previous trained ICSI system on switchboard data and make use of the switchboard dictionary that was created for the workshop.

    Final Results ... In Short

    Skipping to the end. I was able to provide a new baseline system with a WER of 61.5%. This original baseline system had a WER of 63.6%. This is an improvement of 2.1%. However, the baseline HTK system trained on the same data and tested on the same data had a WER of 60%.

    The System

    The system (MLP portion) was trained from a 4 hour male set. The set of data was defined prior to the workshop by ICSI. 9019 male utterances or 1.74 million frames. The test was done on 240 male utterances or 74 thousand frames. The system is a HMM/MLP configuration. The transcription (final output) was derived by rescoring the HTK word lattices provided for the workshop. The MLP provides phone emission (posterior) probabilities. The training was an embedded training that had 4 iterations. The bootnet was based on a TIMIT 1000 hidden unit net. The features were rastaplp 12 (includes 12 delta, 12 delta-delta, delta log energy and delta-delta log energy). The MLP input had a context window of 9 frames.

    The boot lexicon was constructed by using the pronounciations from the WS96AMDict dictionary and the context dependent durations file from a previous training of the cut0 training utterances. The WS96AMDict is constructed with the PRONLEX phoneset which is 42 phones and ICSI phoneset has 56 phones. The following monophone transformation rules were used to transform the WS96AMDict to cover the ICSI56 phoneset.Looking at the transform rules

    The training of the net was implemented using the sigmoidx estimation (sigmoid outputs, relative entropy error criterion).

    Training performance on the Cross Validation set

    The net was re-estimated 3 times. The final frame accuracy for monophone recognition was 60.84%. This is on the whole training set. The best frame accuracy for the cross validation set was 56.75%. This reference shows the accuracy rates for the entire embedded training. Accuracy rates for training

    Final Results

    Several decodes were run on the embedded trained MLP. Besides having a baseline WER, I also investigated the performance vs lattice size and performance vs. # training iterations. In short the performance is

    System Trained using Embedded methodSystem Trained using 1 iteration
    Decoding with
    Large HTK Lattice
    61.5%
    Speaker Stats
    65.4%
    Speaker Stats
    Decoding with
    Pruned HTK Lattice
    61.9%
    Speaker Stats
    65.4%
    Speaker Stats

    TABLE 1

    Observations

    There are two items to note with this experiment. First, using a more task dependent lexicon does help. The question that remains: was it from the durations or from using a switchboard dictionary? It is my belief, based on experiments run by others in the group, that the (minimum) durations have no effect on the performance during recognition. One experiment in particular was run to verify this. Chris Ris took a lexicon and removed all the monophone durations for every word in the lexicon. The Word error was the essentially the same for a no-duration lexicon as for a lexicon that contain phone durations. I have no insight on the effects that minimum durations have for the forced alignment process that is done during the training. Second, the size of the HTK word lattice did not effect performance. The reason for the pruned word lattice was because the original dictionary that others used in the group did not contain all the words in the Larger HTK word lattice. Therefore, Chris Ris pruned the HTK word lattice to match the lexicon that was available for his use.

    Booting ICSI system from HTK training set alignments

    The initial plan was to train a HTK system using the rastaplp-12 features (including the delta and delta-delta features), using the trained system get an alignment of the training data. Using this alignment and the embedded trained MLP (using the switchboard lexicon), run 1 more MLP iteration using the newly obtained alignments. If the HTK system was that much better, then using the alignments from HTK would be a good bootstrap method.

    It turns out there was a flaw with this approach. First, the HTK baseline was not as good as it had been initially advertised and second the first ICSI baseline system (70% WER) was decoded with the NOWAY decoder (which at this time did not permit word insertion penalties, nor did it take advantage of HTK lattices). Taking the ICSI baseline system and rescoring the HTK word lattice (the NOWAY decoder not being used) gave a result of 63.6% WER.

    The proper way to have applied this technique would have been to use the HTK generated labels as the truth for the first iteration of training. This operation would replace the need to do an alignment from a boot net like TIMIT.

    The performance of the HTK alignment after 1 iteration of training did not have the frame accuracy performance as one would have liked. The CV rate for the Baseline system on the final training iteration was 56.75%. This can be compared to a frame accuracy of 49.27% using the HTK aligned data. Therefore, this work was not continued due to time considerations.

    What needs to be done? I should run a HTK evaluation on the devtest data to get a performance point and I should also use the trained MLP and the lattice decoder to see how bad results really are.

    HTK Configuration using the ICSI 56 phone set and RastaPLP12 Features

    The HTK system was trained using the scripts that were made available for the workshop. The training was done on the ICSI defined cut0 male/female switchboard training set. There are 17614 training utterances that contain 9019 male utterances and 8595 female utterances. New HTK feature files were created using ICSI's rasta plp 12 programs. Feature files were created for each training utterance that included 12 rasta plp, 12 delta rasta plp, 12 delta-delta rasta plp, delta log energy and delta-delta log energy. 38 features on a 10 msec frame rate. The WS96AMdict that was transformed to the ICSI 56 phone set was used. The HTK system was trained in a flat start fashion. Using a Word internal trigrams. Using a HAPRUNE threshold of 10000. Using a 300 tree cluster threshold. Using a mixture growing sequence of 2,4,5,7.

    I have include my README file for the training of the ICSI influenced HTK system. I have also included a reference to the environment file used to start of the HTK training scripts.

    The next major change that needed to be done for the HTK system to be used in an ICSI influence was to change how the tree clustering rules were applied. Additional rules were added. This is do to the fact that the dictionary was using the ICSI56 phoneset instead of the PRONLEX phoneset. Barb Wheatly was a great help in getting these rules defined. I have include the tree clustering rules and the rules for state tieing.


    Last modified on October 16, 1996
    Christophe Ris <ris@cspjhu.ece.jhu.edu >