The correct word was found in the lattice in only 8% of substitution errors. Thus the maximum WER improvement attainable by correcting substitution errors is only 1.1%.
In decreasing order of frequency, substitution errors were caused by:
Results above suggest that research should focus on detection of manner-change landmarks, especially in syllable-final position, for detecting word substitutions, insertions, and deletions.
Purpose of closing the loop early and often:
Rescoring experiments will determine which distinctive features and
which acoustic features seem to benefit WER. Inputs and outputs of the
SVMs, SSM, and DBN will be adapted to rescoring results in order to
maximize our chance of success.
It was proposed that Kemal, Ken and Vidya can spend some time running
"sanity checks" to make sure that the amoeba search is finding
reasonable WER minima. They could also test other optimization
techniques, but doubt was expressed that 6 weeks will be enough to
develop a new optimization technique for log-linear lattice rescoring.
With an N-best list, we can specify exactly which distinctive features
distinguish the first-pass transcription from each alternative
sentence. It would therefore be possible to test an alternative
rescoring paradigm, consisting of N-1 binary comparisons between the
first-pass transcription and, in turn, each of its N-1
alternatives. In the alternative paradigm, binary discrimination
between the first-pass transcription and each alternate would be
computed by scoring only a handful of the most salient or reliable
landmarks. At the 4/16 meeting, we discussed a top-down rescoring
algorithm using pinched lattices: each word in the first-pass
transcription would be compared against alternative words listed in
the pinched lattice. This method suffers because pinched lattices lose
precise time alignment, and precise time alignment is critical for
most of the confusions observed in practice (e.g., "movie"
vs. "movies"). I believe that no solution to this problem has been
proposed.
An unpinched lattice does not define alternative transcriptions. In
order to use top-down rescoring, it is necessary either to pinch the
lattice, or to enumerate an N-best list from the lattice.
The DBN needs w*p(acoustics|class), for some arbitrary constant w.
Morgan and Bourlard (Signal Processing Magazine, 1995, 12(5):25-42)
showed that the version of p(class|acoustics) computed by an MLP is
NOT the same one that's needed by the recognizer. Instead, the MLP
computes p(class|acoustics) under the assumption that class priors
during training equal class priors during testing. Usually it is not
wise to use the true priors to train an MLP or SVM, because if the
true prior probability of class=i is very small, the minimum error
classifier might simply choose NEVER to say class=i.
Instead, train using one set of priors pi(class=i), and test using a
different set of priors rho(class=i). For example, pi(class=i) is
often chosen so that all classes have an equal number of training
tokens. Then the output of the MLP or SVM is an estimate of the
following probability:
(**) pi(class=i|acoustics) = p(acoustics|class=i) * pi(class=i) /
sum_j (p(acoustics|class=j)*pi(class=j)).
During testing, the SVM or MLP computes pi(class=i|acoustics). We then
solve the N linear equations (**) in order to find the N unknowns
p(acoustics|class=i). Note that the equations are linear, because we
can multiply both sides of (**) by the denominator of the LHS. The
matrix is singular, however, because sum_j(p(class=j))=1. The null
space of the matrix is as follows: if p(acoustics|class) is a solution
of (**), then so is w*p(acoustics|class), for any real w.
In particular, if pi(class=i) is uniform (equal numbers of all classes
used during training), then the solution to equations (**) is
pi(class=i|acoustics) = w*p(acoustics|class=i) for some w. In other
words, if we use equal numbers of + and - tokens to train the MLP (or
alternatively, train the MLP using a weighted squared error metric,
where the weight is one over the number of training tokens in each
class), then the estimates pi(class=i|acoustics) can be used in place
of p(acoustics|class) by the DBN.
Development Schedule
We will close the loop early and often. Preliminary SSM and DBN scores
(p(word|algorithm)) will be available for lattices covering the
ICSI-Switchboard data by the beginning of the workshop. If RT02 and
RT03 data are available in time, preliminary scores will also be
available for those databases. After beginning of the workshop, we
will try to refine scores on a weekly basis (by Monday morning of
every week?).Lattices or N-best lists?
Until 5/14/04, we assumed that we would rescore lattices this
summer. A number of points raised on 5/14/04 suggested that we may be
better off to rescore N-best lists. The question has not yet been
resolved.
Points in support of N-best lists include:
Multiple Scores
All breakout groups proposed generating multiple system variants. Most
likely there will be a few dozen different system variants; each of
these system variants could produce a different p(word|system)
score. Lattice or N-best rescoring experiments, with development test
data, will determine how many of these scores to combine, and will
determine log-linear weights for the retained scores. Previous
experience suggests that amoeba search can compute good log-linear
combination weights for perhaps 2-10 auxiliary word scores.
Word Context
Preliminary lattice analysis indicates that landmarks near the end of
the word are critical for lattice or N-best rescoring. Therefore
rescoring depends on good information about (1) the time at which a
word ends, (2) the phonemes that follow a word. The lexicons breakout
group proposed that SSMs and DBNs can generate multiple scores, using
multiple context definitions. The following context definition was
specifically proposed.
Algorithm details
Points above are issues that concern all subsystems. The following two
points are somewhat more picayune algorithmic details, but are
probably important for system performance.
Definition of acoustic scores
An MLP computes p(class|acoustics). An SVM computes
p(class|discriminant), where the discriminant is a scalar summary of
the acoustics. The discriminant dimension is computed in a way that
minimizes generalization error, thus p(class|discriminant) can be a
good approximation of p(class|acoustics), even when the dimension of
the acoustic observation vector is 10,000 or more, as in many of our
experiments.Mapping from DBN distinctive features to SVM distinctive features
The last hour of the day was spent on an in-depth analysis of the
mapping between DBN distinctive features and SVM/landmark distinctive
features. A deterministic many-to-one mapping between DBN and SVM
features seems to be pretty easy, except for the following. In order
to distinguish between silence, /y/, and /i/ (or between silence, /n/,
and /en/), the most elegant solution will give the DBN a new hidden
variable called something like "prosodic tier" or "sonorancy rank" or
"slot alignment." The new hidden variable might take four settings:
{silence, consonant, unstressed nucleus, and stressed nucleus}.
CLSP Homepage : Workshop Homepage 
Workshop 2004
Saturday, November 7, 2009
The Center for Language and Speech Processing
The Johns Hopkins University
3400 North Charles Street, Barton Hall
Baltimore, MD 21218
![]()
Telephone: (410) 516-4237
![]()
Fax: (410) 516-5050
![]()
E-mail: clsp@clsp.jhu.edu