CLSP Homepage : Workshop Homepage
Workshop 2004
Workshop 2004 Monday, November 23, 2009

WS04: Landmark-Based Speech Recognition

Notes and Slides: Planning Meeting Number 1, April 16, 2004

  1. Introductory Lectures (links to slides)
  2. Technological Objectives of the Workshop
  3. Scientific Objectives of the Workshop
  4. Expertise and Mentoring of Participants
  5. Methodology
  6. Expertise and Mentoring of Participants
  7. Preliminary Experiments: Foci until May

INTRODUCTORY LECTURES

TECHNOLOGICAL OBJECTIVE OF THE WORKSHOP

We have been successful, so far, at obtaining better binary distinctive feature classification accuracy using the landmark-based method than using any previously proposed method. The technological objective, for this workshop, is to use the landmark-based method to get better speech recognition accuracy. Specifically, the goal is to use landmark detectors and classifiers to rescore a word lattice in such a way that we remove more errors than we introduce into the MAP path.

Several WER metrics are relevant:

SCIENTIFIC OBJECTIVES OF THE WORKSHOP

NOVEL ACOUSTIC OBSERVATIONS

All agreed that the ability to use many different types of acoustic observations is one of the compelling strengths of the landmark-based method. Sarah has tested multi-window observations and MFCC+formant classifiers. MFCCs + formants tends to dramatically outperform MFCCs alone --- speculatively, because the classifier is able to learn comparisons like Stevens' famous definition of strident, "energy of the fricative in the F4/F5 region is higher than energy of the neighboring vowel in the same region," where energy information comes from the MFCC, and reference formant frequencies come from the formant tracker.

The only reason we haven't tested more observations so far is that training & testing an SVM for a very large observation vector takes a very long time.

A hierarchical feature selection approach was suggested: first test features using discriminant analysis and other low-computational complexity methods, then test the winners using SVMs. We agreed to make this one of the preliminary experiments to focus on between now and May.

In particular, should we focus on duration features? It was suggested that accurate landmark detectors could reduce WER simply by more accurately specifying the segment in time over which, e.g., a vowel classifier should be applied.

WHAT'S IMPORTANT TO ENCODE IN THE PRONUNCIATION MODEL?

Both the SVMs and the DBN are computationally expensive. One way to save computational complexity would be to focus on distinctive features with the highest functional load.

Functional load could be measured using the oracle paradigm that Karen used for her HLT/NAACL04 paper: create a word transcription given the phoneme transcriptions available in ICSI-Switchboard. By deleting nodes in the DBN corresponding to each distinctive feature, and measuring the effect of each, it should be possible to estimate its importance for lexical discrimination. Karen expressed hesitation about using ICSI transcriptions, because they are segment-based with some annotation, rather than fully distinctive feature based. It was agreed that Amit and Sarah will generate distinctive feature transcriptions of ICSI-Switchboard using the current generation of SVMs.

Mark pointed out that, in theory, it should never be possible to eliminate a distinctive feature with zero increase in word error, because every distinctive feature by definition is the minimal distinction between at least one pair of words.

Partha's student has defined informational load of a distinctive feature in terms of lexical confusions caused by its absence. From this, he has computed rankings of several distinctive features. Obstruent voicing seems to have low information load in English; vowel features have various information loads, and manner and place tend to have high information load.

Steve described error analysis of Switchboard recognizer outputs (available on his web page), from which he concluded that most word recognition errors seem to be caused by errors in recognition of the place or manner of a syllable onset.

WHAT DISTINCTIVE FEATURES MAKE SENSE FOR LANDMARK DETECTION AND CLASSIFICATION? FOR THE PRONUNCIATION MODEL?

Consensus was that distinctive features should be chosen based on scientific value. Thus they should be chosen from the phonology/phonetics literature. If we are to focus on any subset of the features proposed in the literature, we should focus on a subset with high informational load. For example, previous experiments suggest that place and manner of syllable onsets are the most important features for correcting errors in a word lattice.

Preliminary SVM experiments in landmark detection and classification have used features primarily derived from Keyser & Stevens, "Feature Geometry and the Vocal Tract," Phonology 11:207-236, 1994. In particular, the features "continuant, sonorant, syllabic" are useful because they are sufficient to define the phoneme boundaries and phoneme centers where landmarks occur.

Preliminary experiments in pronunciation modeling have used gestural features primarily derived from Browman & Goldstein, "Articulatory Phonology: An Overview," Phonetica 49:155-180, 1992, because the articulatory phonology is able to represent many reduction and coarticulation phenomena as specific examples of gesture mis-timing, a phenomenon particularly easy to represent using a DBN.

Concern was expressed that the mapping from Keyser-Stevens features to Browman-Goldstein features may not be trivial. Karen will look more closely at this problem.

HOW DO WE SELECT TRAINING EXAMPLES FOR THE SVMS?

Our current method is based on the following assumption: each SVM should represent the acoustic correlates, over a telephone channel, of the canonical implementation of the named distinctive feature. The SVM should report the value of the distinctive feature as it was actually produced by the talker, and the difference between the actual value and the canonical value should be modeled by a pronunciation model. Based on this assumption, we have been extracting training examples from data with fine phonetic transcriptions, but recorded over a telephone channel: NTIMIT and the ICSI Switchboard corpus. NTIMIT is useful because stop burst times are transcribed. It should be noted, however, that NTIMIT transcriptions can only represent 64 different types of segments --- e.g., stops reduced to fricatives, or reduced to glides, are often transcribed with the symbol for the stop consonant, because no glide or fricative with the same place of articulation exists in English. In ICSI-Switchboard, some of these distinctions are annotated. Steve says that if a phoneme was produced non-canonically in ICSI Switchboard, i.e., if one or more distinctive features were not produced according to the canonical form of the phoneme, the one most saliently modified distinctive feature was usually noted in the transcription, and others usually were not.

It has been proposed that we should select training examples discriminatively, i.e., by choosing examples from Switchboard according to the training lattices. SVMs (and many types of classifiers) work best if trained using examples that are hard to classify (close to the interclass boundary). Therefore we expect improved rescoring performance if the SVMs are trained using waveform segments where the first-pass recognizer made a mistake. The difficult problem is: given word lattices, how do we find the time alignment of the problem phoneme? Two methods were proposed. First, the SRI first-pass decoder has the capability to output phoneme time alignments; we can ask Andreas and Kemal whether it would be possible for SRI to generate recognition lattices with phoneme time alignments for some small portion of the training data (consensus was that about fifteen hours of data should contain enough examples to train all of the SVMs, and that even one hour of data would contain enough examples to train many of the landmark detectors). Second, we could apply the existing landmark detectors to find the phoneme boundaries, then extract training examples, and use the extracted examples to retrain the SVMs.

METHODOLOGY

HOW SHOULD WE INTEGRATE THE SVM OUTPUTS?

Scores produced by the SVM could be integrated, in order to compute the total score for each edge in the word lattice, using a DBN representing articulatory phonology. There was consensus that the DBN may be the most flexible and well-founded approach to this problem, but that there are details of the SVM-DBN integration that still need to be worked out in preliminary experiments.

Scores could also be integrated by simply comparing SVM results to the canonical pronunciation (or pronunciations) in a dictionary. This method is less flexible but "safer."

It would be possible to apply either the DBN or a neural-network pronunciation model across word boundaries. In either case, complexity would increase because the lattice allows many possible edges to follow each candidate edge. Ken pointed out that the increase in complexity might be manageable because you would only have to rescore the edge once for each possible following phoneme, not once for each possible following word.

SHOULD WE PINCH THE LATTICES OR NOT?

Unpinched lattices could be rescored by using SVMs to find a landmark-based log-likelihood score "$d" for each edge. For example, dictionary expansion of the edge could propose the landmarks that should be present, and the distinctive features that should be present at those landmarks. SVMs could be applied to find the proposed landmarks, and compute the posterior probability that each distinctive feature is present. The DBN could be used to compute the log likelihood, $d, of the observed SVM outputs given the canonical word transcription. The total edge score is then computed as $b1*$a+$b2*$l+$b3*$d, where $a and $l are the acoustic and language-model scores recorded for this edge by the first-pass recognizer, and $b1, $b2, and $b3 are stream weights estimated by minimizing error on a development test set.

Pinched lattices could be rescored in the same way. The disadvantage of pinching is that original edge start and end times are gone. The advantage is that we wouldn't have to measure all distinctive features: the only distinctive features measured would be those whose values differentiate any two edges in an edgeset.

Pinched lattices could also be rescored by using the SVMs (possibly integrated with the DBN) to choose the best edge from each edgeset. This method could completely ignore the $a and $l variables computed by the first-pass recognizer.

Group consensus was that we should have some method for computing posterior probabilities of each edge (i.e., stream weights). Since stream weights must be computed in any case, it may be possible to run both pinched and unpinched lattices, and see which works better.

EXPERTISE AND MENTORING

This discussion was simultaneous with discussion of preliminary experiments, but for purpose of these notes, may be more conveniently understood separately.

Vidya and Ken will develop expertise in lattice rescoring and confusion networks. Katrin will be their mentor for this purpose.

Emily and Amit will study the use of novel auditory and acoustic phonetic measurements for discrimination of individual distinctive features, especially place and manner features at syllable onset. Jim and Partha expressed interest in mentoring them for this purpose. Carol Espy-Wilson and Ken Stevens were also mentioned as likely resources for this task.

Tom will learn to train and test SVMs, and to analyze their results using linear discriminant analysis, PCA, scatter plots and so on. Sarah, Amit, Partha, and Mark will be his mentors for this purpose.

PRELIMINARY EXPERIMENTS

We will focus on three preliminary experiments between now and May 14. Two are multi-part experiments. For each part, a small team of students and a faculty advisor were named. I (MH) have also named one graduate student to be the chief organizer for each experiment, responsible for maintaining a wiki page, and for keeping track of what other students working on the experiment are doing.

Teams and experiments are malleable to suit your interest. If you think that my description isn't the best way to solve your given problem, talk to your team-mates and find another way to solve it. If you have a good idea for solving somebody else's problem, contact them and suggest it and/or volunteer to try it. If, after seeing your problem described in more detail below, you think that you're not at all interested in that problem, let me know and we'll shuffle responsibilities.

PRONUNCIATION MODEL TESTING.

Chief organizer: Karen. Faculty resource: Mark. Participants: Karen, Amit, Sarah.
  1. Amit will compute two types of landmark transcriptions for the ICSI-transcribed portion of Switchboard. One type of transcription will be a MAP landmark alignment without prior knowledge of words spoken; one will be an MAP alignment given word knowledge. Amit and Sarah will each test all of the distinctive feature classifiers that they have developed so far, for the purpose of identifying features of each of the detected landmarks.
  2. Karen will figure out how to translate landmark-based Keyser & Stevens features (the features on which most SVMs are based) into frame-based Browman & Goldstein features (the features on which her DBN is based). Integration possibilities will be evaluated using a paradigm similar to her HLT/NAACL paper (automatic word transcription of WS97 corpus given manual or SVM-based phonetic transcriptions). Karen will also consider complexity issues, specifically: is it possible to maintain automatic transcription accuracy even if some variables are removed from the DBN, e.g., distinctive features with low functional load?

DISCRIMINATIVE EXTRACTION OF SVM TRAINING EXAMPLES BY PARSING RECOGNITION LATTICES OF TRAINING DATA.

Chief Organizer: Ken. Faculty Resource: Katrin. Participants: Ken, Vidya, Sarah, Tom. First proposed during discussion of point III.D.
  1. Mark will determine whether or not SRI can compute recognition lattices of some subset of the Switchboard training data.
  2. Ken and Vidya will augment lattices with the correct transcription of each sentence, then pinch each lattice to the correct transcription. The goal of this procedure is to determine what words the recognizer considered to be likely substitutes for each correct word.
  3. Ken and Vidya will use a dictionary-based method to determine landmark-based transcription of each edge in the pinched lattice. Incorrect edges will be compared to the correct edge, in order to identify distinctive features that differ. The output of this process will be something like the following, where "discriminative features" are those that discriminate between the correct word and any incorrect word:
    XXX.wav, edgeset minimum start time XXX, maximum start time XXX:"
    Maximum number of syllables of any edge in this edgeset is XX.
    Number of syllables is/isn't discriminative.
    First syllable onset: discriminative features are YYY, YYY, YYY
    First syllable nucleus: discrimative features are ...
  4. Given knowledge of the type shown above, Sarah and Tom will figure out methods to find the best alignment of the specified landmarks (in order, within the specified segment), extract training tokens from the period of time around each landmark, and use the extracted training data to retrain the landmark detectors and classifiers.

EVALUATION OF NOVEL ACOUSTIC PHONETIC AND AUDITORY FEATURES

Chief Organizer: Amit. Faculty Resources: Jim, Partha, Carol Espy-Wilson, Ken Stevens. Participants: Amit and Emily. First proposed during discussion of point III.A.
  1. Read the acoustic phonetics and auditory psychology literature, and talk to students from Espy-Wilson's group and Stevens' group, in order to creatively list all of the acoustic correlates that have been proposed in the literature for each distinctive feature.
  2. Devise automatic algorithms (possibly several) to measure each of those acoustic correlates in the vicinity of transcribed landmark examples in NTIMIT (for stop bursts) or ICSI-Switchboard (for all other landmark types).
  3. Use low-computational complexity methods to narrow the space of possible acoustic representations somewhat. Probable first technique: Measure the percentage overlap between the histograms p(observations|+feature) and p(observations|-feature), for each observation singly, and also for observations in small vectors e.g. 2D and 3D vectors.
  4. After narrowing the space of observations somewhat, try a few SVM experiments. For example, train SVMs: (1) using the whole vector of all possible measurements, including MFCCs, because SVMs are supposed to be good at generalizing from very large observation vectors, (2) using just the best acoustic observations, (3) compare results to those we've already computed using just MFCCs.

The Center for Language and Speech Processing
The Johns Hopkins University
3400 North Charles Street, Barton Hall
Baltimore, MD 21218
*Telephone: (410) 516-4237 *Fax: (410) 516-5050 *E-mail: clsp@clsp.jhu.edu