| CLSP Homepage : Workshop Homepage | |
![]() | |
| Workshop 2004 | Monday, November 23, 2009 |
Several WER metrics are relevant:
The only reason we haven't tested more observations so far is that training & testing an SVM for a very large observation vector takes a very long time.
A hierarchical feature selection approach was suggested: first test features using discriminant analysis and other low-computational complexity methods, then test the winners using SVMs. We agreed to make this one of the preliminary experiments to focus on between now and May.
In particular, should we focus on duration features? It was suggested that accurate landmark detectors could reduce WER simply by more accurately specifying the segment in time over which, e.g., a vowel classifier should be applied.
Functional load could be measured using the oracle paradigm that Karen used for her HLT/NAACL04 paper: create a word transcription given the phoneme transcriptions available in ICSI-Switchboard. By deleting nodes in the DBN corresponding to each distinctive feature, and measuring the effect of each, it should be possible to estimate its importance for lexical discrimination. Karen expressed hesitation about using ICSI transcriptions, because they are segment-based with some annotation, rather than fully distinctive feature based. It was agreed that Amit and Sarah will generate distinctive feature transcriptions of ICSI-Switchboard using the current generation of SVMs.
Mark pointed out that, in theory, it should never be possible to eliminate a distinctive feature with zero increase in word error, because every distinctive feature by definition is the minimal distinction between at least one pair of words.
Partha's student has defined informational load of a distinctive feature in terms of lexical confusions caused by its absence. From this, he has computed rankings of several distinctive features. Obstruent voicing seems to have low information load in English; vowel features have various information loads, and manner and place tend to have high information load.
Steve described error analysis of Switchboard recognizer outputs (available on his web page), from which he concluded that most word recognition errors seem to be caused by errors in recognition of the place or manner of a syllable onset.
Preliminary SVM experiments in landmark detection and classification have used features primarily derived from Keyser & Stevens, "Feature Geometry and the Vocal Tract," Phonology 11:207-236, 1994. In particular, the features "continuant, sonorant, syllabic" are useful because they are sufficient to define the phoneme boundaries and phoneme centers where landmarks occur.
Preliminary experiments in pronunciation modeling have used gestural features primarily derived from Browman & Goldstein, "Articulatory Phonology: An Overview," Phonetica 49:155-180, 1992, because the articulatory phonology is able to represent many reduction and coarticulation phenomena as specific examples of gesture mis-timing, a phenomenon particularly easy to represent using a DBN.
Concern was expressed that the mapping from Keyser-Stevens features to Browman-Goldstein features may not be trivial. Karen will look more closely at this problem.
It has been proposed that we should select training examples discriminatively, i.e., by choosing examples from Switchboard according to the training lattices. SVMs (and many types of classifiers) work best if trained using examples that are hard to classify (close to the interclass boundary). Therefore we expect improved rescoring performance if the SVMs are trained using waveform segments where the first-pass recognizer made a mistake. The difficult problem is: given word lattices, how do we find the time alignment of the problem phoneme? Two methods were proposed. First, the SRI first-pass decoder has the capability to output phoneme time alignments; we can ask Andreas and Kemal whether it would be possible for SRI to generate recognition lattices with phoneme time alignments for some small portion of the training data (consensus was that about fifteen hours of data should contain enough examples to train all of the SVMs, and that even one hour of data would contain enough examples to train many of the landmark detectors). Second, we could apply the existing landmark detectors to find the phoneme boundaries, then extract training examples, and use the extracted examples to retrain the SVMs.
Scores could also be integrated by simply comparing SVM results to the
canonical pronunciation (or pronunciations) in a dictionary. This
method is less flexible but "safer."
It would be possible to apply either the DBN or a neural-network
pronunciation model across word boundaries. In either case, complexity
would increase because the lattice allows many possible edges to
follow each candidate edge. Ken pointed out that the increase in
complexity might be manageable because you would only have to rescore
the edge once for each possible following phoneme, not once for each
possible following word.
Pinched lattices could be rescored in the same way. The disadvantage
of pinching is that original edge start and end times are gone. The
advantage is that we wouldn't have to measure all distinctive
features: the only distinctive features measured would be those whose
values differentiate any two edges in an edgeset.
Pinched lattices could also be rescored by using the SVMs (possibly
integrated with the DBN) to choose the best edge from each
edgeset. This method could completely ignore the $a and $l variables
computed by the first-pass recognizer.
Group consensus was that we should have some method for computing
posterior probabilities of each edge (i.e., stream weights). Since
stream weights must be computed in any case, it may be possible to run
both pinched and unpinched lattices, and see which works better.METHODOLOGY
HOW SHOULD WE INTEGRATE THE SVM OUTPUTS?
Scores produced by the SVM could be integrated, in order to compute
the total score for each edge in the word lattice, using a DBN
representing articulatory phonology. There was consensus that the DBN
may be the most flexible and well-founded approach to this problem,
but that there are details of the SVM-DBN integration that still need
to be worked out in preliminary experiments.SHOULD WE PINCH THE LATTICES OR NOT?
Unpinched lattices could be rescored by using SVMs to find a
landmark-based log-likelihood score "$d" for each edge. For example,
dictionary expansion of the edge could propose the landmarks that
should be present, and the distinctive features that should be present
at those landmarks. SVMs could be applied to find the proposed
landmarks, and compute the posterior probability that each distinctive
feature is present. The DBN could be used to compute the log
likelihood, $d, of the observed SVM outputs given the canonical word
transcription. The total edge score is then computed as
$b1*$a+$b2*$l+$b3*$d, where $a and $l are the acoustic and
language-model scores recorded for this edge by the first-pass
recognizer, and $b1, $b2, and $b3 are stream weights estimated by
minimizing error on a development test set.
| The Center for Language and Speech Processing The Johns Hopkins University 3400 North Charles Street, Barton Hall Baltimore, MD 21218 | |||||
| Telephone: (410) 516-4237 | Fax: (410) 516-5050 | E-mail: clsp@clsp.jhu.edu | |||