An ICSI based system was built using this feature vector. An MLP was trained using 1 iteration. The training labels were based on the baseline system built by ICSI before the workshop. The labels were not based on the baseline system described in this paper. The MLP had 140 inputs, 1000 hidden units, and 56 outputs. The net was trained using sigmoidx. The training material was the male-cut0 ICSI defined set. The evaluation was on the 240 male utterances from the dev-data. The inital weights for the net were random. This is different from other systems. The usual starting point for an ICSI MLP is to use weights derived from the TIMIT set. Since, this feature set was never used on TIMIT, the net had to be started off by using random weights.
Result 1 ... CHAF Baseline
This is the baseline system performance using the CHAF feautes described above. The frame accuracy of the training set was somewhat promising. But the overall performance was 73.5% WER (Speaker Stats).
Result 2 ... CHAF Baseline + ICSI Baseline no durations
In this configuration the ICSI basline system (that gave a 63.6% WER) and the baseline CHAF system were combined together. This system assumed the lexicon used phone models that had no context dependent durations. This was accomplished by summing the posteriors of the MLP output for the 240 male utterances. The performance of this system was 62.8% WER (Speaker Stats).
The last and final result shows that after 1 iteration of training a 0.6% WER reduction can be accomplished when integrating the CHAF features with the ICSI standard rastaplp features under 1 MLP. Perhaps the CHAF features do provide some important timing information, but the current implementation does not support this. There is ongoing investigation into features of the same time scale or longer, both as adjuncts to the speech recognition features, and as a scientific issue in itself. We do see suggestive statistics from LDA analyses of these kinds of feature sets suggesting that we are capturing structure previously ignored, but these investigations are in the initial stages.
Result 3 ... CHAF Baseline + ICSI Baseline w/ durations
This is the same system as in Result 2, except for the lexicon was built using context dependent duration phone models. The performance of this system was 62.3% WER (Speaker Stats).
Result 4 ... CHAF + ICSI + 4 Band Multiband
This system combined the ICSI baseline system, the 4band multi-band baseline system and the CHAF baseline system. The lexicon did have context dependent phone duration models. The performance of this system was 61.2% WER (Speaker Stats). This system can be compared to the groups best performing system of 59.4% WER (This is combining the 4band system with the ICSI baseline system).
Result 5 ... ICSI Feasutres and CHAF in same MLP
This result involved putting together a new system. This system used the training software that allows two different streams of data. The first stream was defined as being the standard ICSI rastaplp (38 element) features over a 9 frame window. The second stream of features was the CHAF (the upper and lower frequency band delta energies) features over a 25 frame window. Since the baseline system described in this paper (using a swboard task dependent lexicon) had better performance than the ICSI baseline system, the same swboard task dependent lexicon describe in this paper was used in this system. This system was trained for 1 iteration only. The labels from the last iteration of training of the SWBOARD baseline system (as described in this paper) was used. The net had 392 inputs, 1000 hidden units, 56 outputs and used sigmoidx. The net started from a random weights. The frame accuracy for the cross validation set was 55.08%. The decoding was done on the larger HTK lattice. The performance of the system was 64.8% WER (Speaker Stats). This result can be compared to the 65.4% WER presented in the TABLE 1. Futher iterations of training was attempted with this system, but was unable to accomplish a full embeded training of the this system.
Observations
The one thing that stands out with these 5 results is that the CHAF features are not being used correctly. Result 1 shows that these features by themselves did not provide recognition anywhere near our baseline systems. Result 2 and 3 show that when combined with the baseline system, they offered more than a percentage point (absolute) decrease in the word error rates. However, result 4 is a good negative result. Performance was hurt by using the CHAF features.
Last modified on October 16, 1996