Pronunciation Group Meeting 23.6.97 (Sanjeev Khudanpur)
1. Bill Byrne, Sanjeev Khudanpur, Mike Riley, Murat Saraclar and
Chuck Wooters were present.
2. Murat reported some ongoing work in preparation for the building
the vanilla system for the workshop. The tree building stuff
definitely works.
(a) ICSI transcriptions of approximately two hours were used to build
decision trees for the phoneme-to-phone mapping. They were tested
on a 32 minute portion of the WS96 dev-test which has also been
transcribed by ICSI.
(b) A second set of trees was built by mixing a subset of TIMIT with
the ICSI data. These trees were also tested on the 32 minute set
mentioned above.
(c&d) A third set of trees, grown with only TIMIT training data, was
built. These trees were tested on the 32 minute ICSI test set as
well as a test set of comparable size drawn from TIMIT.
Without smoothing the tree-probabilities, the test set contains
unseen events which get a zero probability. Rather than smoothing
the trees, the worst 10% tokens in the test set were discarded in
reporting the following test results.
Source of Training ICSI I+T TIMIT TIMIT
Source of Test set ICSI ICSI ICSI TIMIT
Prplxty at root (bits) 0.9109 0.8749 0.9084 0.3402
Prplxty of tree (bits) 0.6827 0.6099 0.7406 0.1524
Efficiency I(X,Y)/H(X) 25.05% 30.28% 18.48% 55.19%
Note that while the TIMIT/TIMIT trees reduce the "unigram"-
perplexity by as much as a half, the same doesn't hold for the
ICSI/ICSI set.
3. While on the subject of the 32 minute ICSI test set, we mention
that the sentences in there are a shade harder than the training
set with regards to prediction! The test set is about one fourth
the training set in size. Randomly dividing the 2-hour ICSI
training set into roughly four equal parts, Murat built five sets
of "root-only" trees, one from each fourth, and one from the test
set. These trees were then tested on each other's training data
sets.
Train-> 1 2 3 4 test
Test
1 0.69 0.71 0.71 0.71 0.78
2 0.70 0.71 0.71 0.71 0.79
3 0.69 0.72 0.71 0.71 0.79
4 0.70 0.72 0.71 0.71 0.79
test 0.92 0.91 0.91 0.91 0.87
At this point, we conjecture that the differences are due to the
fact that the 32-minute set comes form last year's labeling effort
while most of the training material was labeled this year using
slightly different conventions!
4. There was also some discussion on whether entropy like measures
were appropriate for checking the goodness of the decision trees
for our purposes. This, it was felt, should be resolved soon so
that any benefits of adding new features to the trees, or other
modifications, can be examined in a manner short of a recognition
experiment. The discussion shall be resumed when we meet in July.
5. Mike reported that the FSA tools for converting a (training) word
`sequence'+trees into a phone network seems to be working well.
The FSA tools for converting a (test) word `lattice'+trees into a
phone graph is almost working - Mike is weeding out the last of
the bugs.
6. Bill reported his ongoing effort with vocal tract length (VTL)
normalization and MLLR based adaptation. The gains from these two
does not seem to be as significant in our HTK based system as some
other sites (eg. BBN) have reported in the last Hub5 evaluation.
The following are the results on the WS96 dev-test.
Baseline system with
MF-PLP 54.19% word accuracy
VTL on Test only 55.25%
VTL Train+Test 55.66%
VTL + one pass of MLLR 56.04%
The VTL normalization was done using the warping giving the
maximum likelihood to the output of the first decoding pass, a
second decoding pass was performed with the warped data, and MLLR
was performed on the output of this second pass. Even though
these gains are not very large, and it would be good to know why,
we expect that this should be enough of an improvement to reduce
the lattice WER substantially.
If you have managed to read all the way down to this line, you are one
tenacious person :)
Cheers,
- Sanjeev
Harriet Nock
Last modified: Sat Jul 5 14:58:29 EDT 1997