Automatic Speech Recognition with Word-Level Models
This lab will focus on building a speaker-independent ASR system for
recognizing strings of spoken numbers. Therefore, the vocabulary of
the task is 11 words, including zero pronounced as "oh" and "sil". The
notes below will refer to a ExptDir, a template for your work which
can be found at /export/arnab/work/summer-school06/template. The subdirectories
ExptDir/train/setup and ExptDir/test/setup contains the necessary framework to
replicate or build upon the baseline system.
All the programs below should be executed on an xYY machine, where
YY=13... 36.
ssh -X xYY
Copy a template to start your own experimentation:
ExptDir=/export/arnab/work/summer-school06/template
cd YourExptDir
cp -r $ExptDir .
Note: If your shell is tcsh (echo $SHELL), then prefix all variable
assignments with "set", as in:
set ExptDir=/export/zak/macrophone/lab/sys1
Corpus  
Model Parameters  
Training the System  
Evaluating the System
Corpus
All experiments in this lab uses MacroPhone
corpus from LDC. For training, we use a subset of about 8k
utterances from a large number of speakers, each speaker contributing
a few utterances. The utterances refer to a string of numbers spoken
in various contexts. The test comprises of three sets with different
string lengths, namely, 7, 10 and 12 digits.
ExptDir=/export/arnab/work/summer-school06/
The scripts below will need the following inputs from the corpus.
ExptDir/setup/trn.utt.list - list of input feature vector for training
utterances.
ExptDir/setup/htk.cfg - HTK configuration file to read the above
features.
ExptDir/setup/trn.word.mlf - word strings associated with above
utterance.
The files associated with test can be found in
/export/arnab/work/summer-school06/setup.
dev.7.list - list of input feature vectors in 7 digit test set.
dev.10.list - list of input feature vectors in 10 digit test set.
dev.12.list - list of input feature vectors in 12 digit test set.
htk.cfg - HTK configuration file to read the above features.
devtst.word.mlf - word strings associated with above utterance.
Model Parameters
The most important parameter that controls the behavior of the ASR
system is the number of states used for each model or digit in this
case. This is controlled using the HMM topology file,
ExptDir/setup/hmmtop. All experiments in this lab assumes the HMMs to
have a left-to-right topology. In principle, by modifying the
ExptDir/scripts/clonehmm.pl, it should also be possible to have mode
complex topologies which allows states to be skipped. The
ExptDir/setup/hmmproto is a prototype of the model and contains
information such as the type and length of the observation vector, the
type of covariance and the kind of feature vector to expect. The
ExptDir/setup/wrdlist contains the list of digits for which models
need to be trained.
Training the System
ASR systems are usually trained by increasing the complexity of the
model in steps. The main script, ExptDir/setup/mlTrainLocal.pl, performs the
training using the following steps.
- InitializeModels: Compute a global mean and variance over all
features and copy that to all states of all models.
- EMTrain: Apply four iterations of EM algorithm to improve the
models, using strings with no inter-word silences.
- EMTrain: Apply four iterations of EM algorithm to improve the
models, using strings with inter-word silences.
- ViterbiAlign: Allow the latest model to pick and choose which
inter-word silences to keep and which to ignore.
- EMTrain: Apply four iterations of EM algorithm to improve the
models, using strings the new alignment.
- Mixture Splitting: Increase the number of components in the
Gaussian mixture model as per the Schedule.
- EMTrain: Re-estimate the components of the Gaussian mixture
models using EM algorithm.
Note: The last two steps are commented out of the script.
One could potentially attempt several variants of this procedure, with
more ViterbiAligns or fewer EMTrains or gradual/faster mixture
splitting schedule. Any optimization of this procedure can only be
carried out empirically.
cd YourExptDir/train
./setup/mlTrainLocal.pl YourExptDir/train
Evaluating the System
To evaluate the system, you need to define a grammar, a space of all
hypothesis, possibly with costs or probabilities associated with each
hypothesis. In this task, any digit could follow any digit, so the set
of hypothesis is defined by an open loop grammar with no cost,
/export/arnab/work/summer-school06/setup/wdnet. In addition to parameters that you have already used
above, the decoder could optionally use a word insertion penalty, this
reduces the tendency of the ASR system to spew out spurious words.
cd YourExptDir/test
mkdir results
mmf=YourExptDir/train/CI-3/hmm4/MMF
odir=YourExptDir/test/results
wip=-60
./setup/test.sh $mmf $wip $odir
./setup/eval.sh $odir
For changing your grammar you will need to use HParse. Look at file:/export/ears/common/src/htk/HTKBook/htkbook/node156_mn.html for a description.