Automatic Speech Recognition with Word-Level Models

This lab will focus on building a speaker-independent ASR system for recognizing strings of spoken numbers. Therefore, the vocabulary of the task is 11 words, including zero pronounced as "oh" and "sil". The notes below will refer to a ExptDir, a template for your work which can be found at /export/zak/macrophone/lab/sys1. The subdirectories ExptDir/setup and ExptDir/scripts contains the necessary framework to replicate or build upon the baseline system.

All the programs below should be executed on an xYY machine, where YY=01... 36.
ssh -X xYY

Copy a template to start your own experimentation:
ExptDir=/export/zak/macrophone/lab/sys1
cd YourExptDir
cp -r $ExptDir/setup .
cp -r $ExptDir/scripts .
TstDir=/export/zak/macrophone/lab/test
cd YourTestDir
cp $TstDir/* .

Note: If your shell is tcsh (echo $SHELL), then prefix all variable assignments with "set", as in:
set ExptDir=/export/zak/macrophone/lab/sys1

Corpus   Model Parameters   Training the System   Evaluating the System


Corpus

All experiments in this lab uses MacroPhone corpus from LDC. For training, we use a subset of about 8k utterances from a large number of speakers, each speaker contributing a few utterances. The utterances refer to a string of numbers spoken in various contexts. The test comprises of three sets with different string lengths, namely, 7, 10 and 12 digits.

The scripts below will need the following inputs from the corpus.
ExptDir/setup/trn.utt.list - list of input feature vector for training utterances.
ExptDir/setup/htk.cfg - HTK configuration file to read the above features.
ExptDir/setup/trn.word.mlf - word strings associated with above utterance.

The files associated with test can be found in /export/zak/macrophone/lab/test.
dev.7.list - list of input feature vectors in 7 digit test set.
dev.10.list - list of input feature vectors in 10 digit test set.
dev.12.list - list of input feature vectors in 12 digit test set.
htk.cfg - HTK configuration file to read the above features.
devtst.word.mlf - word strings associated with above utterance.


Model Parameters

The most important parameter that controls the behavior of the ASR system is the number of states used for each model or digit in this case. This is controlled using the HMM topology file, ExptDir/setup/hmmtop. All experiments in this lab assumes the HMMs to have a left-to-right topology. In principle, by modifying the ExptDir/scripts/clonehmm.pl, it should also be possible to have mode complex topologies which allows states to be skipped. The ExptDir/setup/hmmproto is a prototype of the model and contains information such as the type and length of the observation vector, the type of covariance and the kind of feature vector to expect. The ExptDir/setup/wrdlist contains the list of digits for which models need to be trained.

Training the System

ASR systems are usually trained by increasing the complexity of the model in steps. The main script, ExptDir/mkHMMs.sh, performs the training using the following steps.
One could potentially attempt several variants of this procedure, with more ViteriAligns or fewer EMTrains or gradual/faster mixture splitting schedule. Any optimization of this procedure can only be carried out empirically. The main script, ExptDir/scripts/mkHMMs.sh, requires an input file, an example of which can be found in ExptDir/scripts/expt.in. Many of the parameters in the input file have been described above. In addition, the variables nProcs (number of processes) and qsubHdr (array resource request) define the parallel computing environment needed for training. The variable nEM sets the number of iterations of EM carried out each time EMTrain is invoked.

cd YourExptDir/scripts

Remember to edit edir in expt.in to point to YourExptDir.

./mkHMMs.sh ./expt.in &> ./mkHMMs.log


Evaluating the System

To evaluate the system, you need to define a grammar, a space of all hypothesis, possibly with costs or probabilities associated with each hypothesis. In this task, any digit could follow any digit, so the set of hypothesis is defined by an open loop grammar with no cost, TstDir/wdnet. In addition to parameters that you have already used above, the decoder could optionally use a word insertion penalty, this reduces the tendency of the ASR system to spew out spurious words. See the files *.res to check the results. Using the template, you should be able to obtain word error rate in the range of 2-4%.

cd YourExptDir
mkdir results
d=YourExptDir/setup
dict=$d/dict
mlist=$d/wrdlist
mmf=YourExptDir/CI-6-Mix/hmm4/MMF
odir=YourExptDir/results
wip=-60
$YourTstDir/test.sh $dict $mlist $mmf $wip $odir