Automatic Speech Recognition with Phone-Level Models
This lab will focus on building a speaker-independent ASR system for
recognizing strings of spoken numbers. Therefore, the vocabulary of
the task is 11 words, as in previous system. The template for this
system can be found in /export/zak/macrophone/lab/sys2. Again, the
subdirectories ExptDir/setup and ExptDir/scripts contains the
necessary files to replicate or build upon the baseline system.
Copy a template to start your own experimentation:
ExptDir=/export/zak/macrophone/lab/sys2
cd YourExptDir
cp -r $ExptDir/setup .
cp -r $ExptDir/scripts .
TstDir=/export/zak/macrophone/lab/test
cd YourTestDir
cp $TstDir/* .
Note: If your shell is tcsh (echo $SHELL), then prefix all variable
assignments with "set", as in:
set ExptDir=/export/zak/macrophone/lab/sys2
Corpus  
Model Parameters  
Training the System  
Evaluating the System
Corpus
Same as in previous system, described here.
Model Parameters
In this system, the acoustic models are at the phone-level. Again, the
design decision includes picking the number of states for each
model. This is controlled through the HMM topology file,
ExptDir/setup/hmmtop. Again, a left-to-right HMM topology is
assumed. The ExptDir/setup/hmmproto is a prototype of the model and
contains information such as the type and length of the observation
vector, the type of covariance and the kind of feature vector to
expect. The ExptDir/setup/phnlist contains the list of phonemes in the
dictionary.
In addition, the system designer needs to decide which phone set to
use to expand the words into phonemes in the dictionary. For an
example, see ExptDir/setup/dict. The one big advantage of using
phone-level model is that the recognition system can decode new words
that are included in the test dictionary, without having seen those
specific words in the training data.
Training the System
Here again, we follow the same recipe as in the previous system,
increasing the complexity of the system gradually in steps, as in
ExptDir/mkHMMs.sh.
- InitializeModels: Compute a global mean and variance over all
features and copy that to all states of all models.
- EMTrain: Apply four iterations of EM algorithm to improve the
models, using strings with no inter-word silences.
- EMTrain: Apply four iterations of EM algorithm to improve the
models, using strings with inter-word silences.
- ViterbiAlign: Allow the latest model to pick and choose which
inter-word silences to keep and which to ignore.
- EMTrain: Apply four iterations of EM algorithm to improve the
models, using strings the new alignment.
- Mixture Splitting: Increase the number of components in the
Gaussian mixture model as per the Schedule.
- EMTrain: Re-estimate the components of the Gaussian mixture
models using EM algorithm.
Other variants of this recipe may produce better results than the one
given in the template. Any optimization of this procedure can only be
carried out empirically.
The main script, ExptDir/scripts/mkHMMs.sh, requires an input file, an
example of which can be found in ExptDir/scripts/expt.in. Many of the
parameters in the input file have been described above. In addition,
the variables nProcs (number of processes) and qsubHdr (array resource
request) define the parallel computing environment needed for
training. The variable nEM sets the number of iterations of EM carried
out each time EMTrain is invoked.
cd YourExptDir/scripts
rem: Edit edir in expt.in to point to YourExptDir.
./mkHMMs.sh ./expt.in &> ./mkHMMs.log
Evaluating the System
As in the previous system, an open loop cost-less grammar is used for
evaluating this system. The results are not very sensitive to the word
insertion penalty. Check the files *.res to see the results. The
template should give you word error rate in the range of 2-4%.
cd YourExptDir
mkdir results
d=YourExptDir/setup
dict=$d/dict
mlist=$d/wrdlist
mmf=YourExptDir/CI-6-Mix/hmm4/MMF
odir=YourExptDir/results
wip=-60
$YourTstDir/test.sh $dict $mlist $mmf $wip $odir