WS04 Workshop Group, Landmark-Based Speech Recognition

Second Planning Meeting, May 14, 2004

Contents

  1. Goals
  2. System Architecture
  3. Morning breakout sessions.
  4. Summary, discussion, and lunch
  5. How Well Will Rescoring Work?
  6. Development Schedule
  7. Lattices or N-Best Lists?
  8. Afternoon breakout sessions:
  9. Summary, discussion, and conclusions
  10. Detailed analysis of relationship between distinctive feature systems of the DBN and SVMs

Goals for the summer: large and small

Our goals for this summer include short-term goals (reduce WER), and very long-term goals (provide mathematical and empirical evidence supporting the claim that advanced machine learning methods, together with psychologically plausible models of nonlinear phonology, may yield improved models of pronunciation variation and of the acoustic correlates of lexical distinctions). The short-term goal is important because it brings credibility to our pursuit of the long-term goals.

System Architectures/Methods

At the beginning of the day, Mark outlined a prototype architecture for lattice rescoring based on DBN integration of acoustic phonetic landmark scores. Mark's architecture received at least one critical revision during the day: all three breakout sessions independently proposed that the system should be modular, in that forced alignment of pronunciations from a canonical lexicon (a standard list-type lexicon with 1 or 1.5 pronunciations per word), using Amit's stochastic segment model, should be a baseline for experiments with the more flexible DBN-based lexicon. The final consensus architecture looks like this:

  • Word labels and word alignment times are read from lattices or N-best lists.
  • Stochastic segment model (SSM) with SVM-based landmark scores is run in two modes. In both modes, SVMs will score only the distinctive features for which SVMs can provide lexically distinctive information complementary to that available to the HMM. Which distinctive features are included in this definition? Preliminary lattice analysis suggests that detection of consonant deletions and insertions, and correct labeling of consonant manner and place, are most important, but decisions such as this should be constantly revisited as more empirical data become available. Modes of SSM processing are:
    1. Unconstrained mode: identify the landmarks & distinctive features that are most probable given acoustics, with no prior information.
    2. Constrained mode: SSM estimates the probability of each word specified in the lattice, assuming that the word is pronounced according to its canonical pronunciation given in the dictionary. Call this p(word|SSM).
    3. DBN observes the landmark scores computed in unconstrained mode. From these, the DBN computes several different estimates of the posterior probability of each word label in the lattice, or in an N-best sentence, given the SVM scores. Different estimates will make different assumptions about the word start and end times (relative to the start and end times given in the lattice), and about the word context phonemes. Call these p(word|DBN,context)
    4. Lattice or N-best sentences are rescored according to logp(word|language model) + a*logp(word|HMM) + b*logp(word|SSM) + c*logp(word|DBN,context1) + d*logp(word|DBN,context2)+...
    Constants are estimated based on development test data. The set of DBN and SSM context definitions to include are also determined based on development test data.

    How well will rescoring work?

    Katrin reported some preliminary experiments with the BBN lattices. In her preliminary experiments, she computed a 23.5% WER for these lattices on RT01 development test data. This includes 4.0% insertions, 5.1% deletions, and 14.5% substitutions.

    The correct word was found in the lattice in only 8% of substitution errors. Thus the maximum WER improvement attainable by correcting substitution errors is only 1.1%.

    In decreasing order of frequency, substitution errors were caused by:

    1. Word-final consonant clusters: travel vs. traveled, burns vs. burned.
    2. Function word sequences vs. longer words (in a sense vs. innocence).
    3. Multiple confusions (discretion vs. discussion).
    4. Word-initial consonant confusions.
    Vowel confusions were very rare.

    Results above suggest that research should focus on detection of manner-change landmarks, especially in syllable-final position, for detecting word substitutions, insertions, and deletions.

    Development Schedule

    We will close the loop early and often. Preliminary SSM and DBN scores (p(word|algorithm)) will be available for lattices covering the ICSI-Switchboard data by the beginning of the workshop. If RT02 and RT03 data are available in time, preliminary scores will also be available for those databases. After beginning of the workshop, we will try to refine scores on a weekly basis (by Monday morning of every week?).

    Purpose of closing the loop early and often:

    Rescoring experiments will determine which distinctive features and which acoustic features seem to benefit WER. Inputs and outputs of the SVMs, SSM, and DBN will be adapted to rescoring results in order to maximize our chance of success.

    Lattices or N-best lists?

    Until 5/14/04, we assumed that we would rescore lattices this summer. A number of points raised on 5/14/04 suggested that we may be better off to rescore N-best lists. The question has not yet been resolved. Points in support of N-best lists include:
    1. SRI quite routinely uses word-level side information (the title of Vergyri's 2000 ICASSP paper) to rescore a 2000-best list. Methods for doing so are well-proven. Some of these methods are known to perform less well for lattice rescoring, e.g., there is some sense that amoeba search may be more likely to find a suboptimal local minimum of WER if used for lattice rescoring instead of N-best rescoring.

      It was proposed that Kemal, Ken and Vidya can spend some time running "sanity checks" to make sure that the amoeba search is finding reasonable WER minima. They could also test other optimization techniques, but doubt was expressed that 6 weeks will be enough to develop a new optimization technique for log-linear lattice rescoring.

    2. The possibility of top-down rescoring.

      With an N-best list, we can specify exactly which distinctive features distinguish the first-pass transcription from each alternative sentence. It would therefore be possible to test an alternative rescoring paradigm, consisting of N-1 binary comparisons between the first-pass transcription and, in turn, each of its N-1 alternatives. In the alternative paradigm, binary discrimination between the first-pass transcription and each alternate would be computed by scoring only a handful of the most salient or reliable landmarks. At the 4/16 meeting, we discussed a top-down rescoring algorithm using pinched lattices: each word in the first-pass transcription would be compared against alternative words listed in the pinched lattice. This method suffers because pinched lattices lose precise time alignment, and precise time alignment is critical for most of the confusions observed in practice (e.g., "movie" vs. "movies"). I believe that no solution to this problem has been proposed.

      An unpinched lattice does not define alternative transcriptions. In order to use top-down rescoring, it is necessary either to pinch the lattice, or to enumerate an N-best list from the lattice.

      Multiple Scores

      All breakout groups proposed generating multiple system variants. Most likely there will be a few dozen different system variants; each of these system variants could produce a different p(word|system) score. Lattice or N-best rescoring experiments, with development test data, will determine how many of these scores to combine, and will determine log-linear weights for the retained scores. Previous experience suggests that amoeba search can compute good log-linear combination weights for perhaps 2-10 auxiliary word scores.

      Word Context

      Preliminary lattice analysis indicates that landmarks near the end of the word are critical for lattice or N-best rescoring. Therefore rescoring depends on good information about (1) the time at which a word ends, (2) the phonemes that follow a word. The lexicons breakout group proposed that SSMs and DBNs can generate multiple scores, using multiple context definitions. The following context definition was specifically proposed.

      1. The segment of time spanned by a word will be expanded by about 50ms on either side (start_time -= 50ms, end_time += 50ms). Some of these extra frames will be dropped in step 3.
      2. The starting position of all articulators will be set to the last phoneme of the preceding word---before the first landmark of the word. Likewise, ending position of all articulators will be set to the first phoneme of the following word, after the last landmark of the word. "Preceding word" and "following word" are according to the first-pass transcription, since the first-pass transcription is right 75% of the time. These extra "phonemes" will be dropped in step 3.
      3. Forced alignment will be used to find the best alignment time for the first and last landmarks of the word. The probability p(word|algorithm) will be computed using only the frames that fall between the first landmark and last landmark of the word, inclusive. In the constrained-SSM alignment, this operation is done by the SSM. In DBN alignment, this operation is done by the DBN.

      Algorithm details

      Points above are issues that concern all subsystems. The following two points are somewhat more picayune algorithmic details, but are probably important for system performance.

      Definition of acoustic scores

      An MLP computes p(class|acoustics). An SVM computes p(class|discriminant), where the discriminant is a scalar summary of the acoustics. The discriminant dimension is computed in a way that minimizes generalization error, thus p(class|discriminant) can be a good approximation of p(class|acoustics), even when the dimension of the acoustic observation vector is 10,000 or more, as in many of our experiments.

      The DBN needs w*p(acoustics|class), for some arbitrary constant w.

      Morgan and Bourlard (Signal Processing Magazine, 1995, 12(5):25-42) showed that the version of p(class|acoustics) computed by an MLP is NOT the same one that's needed by the recognizer. Instead, the MLP computes p(class|acoustics) under the assumption that class priors during training equal class priors during testing. Usually it is not wise to use the true priors to train an MLP or SVM, because if the true prior probability of class=i is very small, the minimum error classifier might simply choose NEVER to say class=i.

      Instead, train using one set of priors pi(class=i), and test using a different set of priors rho(class=i). For example, pi(class=i) is often chosen so that all classes have an equal number of training tokens. Then the output of the MLP or SVM is an estimate of the following probability:

      (**) pi(class=i|acoustics) = p(acoustics|class=i) * pi(class=i) / sum_j (p(acoustics|class=j)*pi(class=j)).

      During testing, the SVM or MLP computes pi(class=i|acoustics). We then solve the N linear equations (**) in order to find the N unknowns p(acoustics|class=i). Note that the equations are linear, because we can multiply both sides of (**) by the denominator of the LHS. The matrix is singular, however, because sum_j(p(class=j))=1. The null space of the matrix is as follows: if p(acoustics|class) is a solution of (**), then so is w*p(acoustics|class), for any real w.

      In particular, if pi(class=i) is uniform (equal numbers of all classes used during training), then the solution to equations (**) is pi(class=i|acoustics) = w*p(acoustics|class=i) for some w. In other words, if we use equal numbers of + and - tokens to train the MLP (or alternatively, train the MLP using a weighted squared error metric, where the weight is one over the number of training tokens in each class), then the estimates pi(class=i|acoustics) can be used in place of p(acoustics|class) by the DBN.

      Mapping from DBN distinctive features to SVM distinctive features

      The last hour of the day was spent on an in-depth analysis of the mapping between DBN distinctive features and SVM/landmark distinctive features. A deterministic many-to-one mapping between DBN and SVM features seems to be pretty easy, except for the following. In order to distinguish between silence, /y/, and /i/ (or between silence, /n/, and /en/), the most elegant solution will give the DBN a new hidden variable called something like "prosodic tier" or "sonorancy rank" or "slot alignment." The new hidden variable might take four settings: {silence, consonant, unstressed nucleus, and stressed nucleus}.

      CLSP Homepage : Workshop Homepage
      Workshop 2004
      Workshop 2004 Saturday, November 7, 2009

      The Center for Language and Speech Processing
      The Johns Hopkins University
      3400 North Charles Street, Barton Hall
      Baltimore, MD 21218
      *Telephone: (410) 516-4237 *Fax: (410) 516-5050 *E-mail: clsp@clsp.jhu.edu