CLSP Homepage : Workshop Homepage
Workshop 2004
Workshop 2004 Sunday, November 23, 2008

Landmark-Based Speech Recognition: Problem Statement and On-line Discussion

(Mark Hasegawa-Johnson)

The goal of this workshop is to learn high-dimensional models of high-temporal-resolution spectral dynamics near instants of consonant release, consonant closure, and syllable nuclei --- information complementary to the information extracted by a best-of-breed Switchboard recognizer --- and to use this novel information to improve WER of the ML path through a Switchboard recognition lattice.

Methods are still under discussion (January 2004), but several people like the idea of a top-down approach (comment on bottomup vs topdown). Here is one possible approach, moving in from lattice to landmarks, then out again from landmarks to lattice:

  1. List alternative words. Given the lattice, create a list of the words that are plausible candidate explanations of each time segment.
  2. List alternative distinctive features. Look up each word in a lexicon model, find probabilities of all plausible pronunciations, and decide which distinctive features would provide most information for discriminating among the word candidates. distinctive feature definition
  3. List important landmarks. Determine which consonant release, consonant closure, syllable nucleus, or intervocalic glide would be the best place to find acoustics correlated with the important distinctive features.
  4. Find the landmarks, using SVMs trained as landmark detectors ([link]). It is not a case of SVMs vs HMMs.
  5. Score the distinctive features. Use SVMs to compute a nonlinear discriminant function for each important distinctive feature, based on observation of high-dimensional, high-resolution acoustics centered at the landmark.
  6. Score the words. Integrate distinctive feature likelihoods with lexicon priors to create a landmark-based acoustic score for each word.
  7. Score the lattice. Integrate landmark-based acoustic score, HMM-based acoustic score, and language model score, using stream weights trained on development test data.
  8. Test. Metric: difference between WER of the ML paths through the baseline and rescored lattice.

Bottom Up vs Top Down Processing

(Jim Baker's comments, Nov 26)

Landmarks are ideally suited for bottom up processing, as implied by the phrase "Landmark Detection". However, lattice rescoring is most naturally done top down.

This note argues that the conflict should be made explicit and taken on head-on. Ignoring it will only lead to confusion.

Landmarks could be used in at least three different parts of a recognition system:

Usage preliminary to the main search would mainly be for the purpose of computation reduction. If the landmarks are not used in the main search itself, then they would not affect the rankings of the word sequence hypotheses in the main search, but they could reduce the amount of computation to do a search. This usage would naturally be bottom-up and this note recommends using landmark detection for this purpose, but it is not within the scope of this research project.

The simplest way to use landmarks in the main search is to define a bit vector for each frame indicating what landmarks, if any, have been detected at that point in time. Jeff Bilmes suggested this in the workshop planning meeting. However, this representation goes directly against the philosophy under which Ken Stevens proposed landmarks in the first place. Jim Baker argued at the planning meeting that there are other ways to use landmarks in the main search that do not require treating them as something that is observed (or not) at every frame. Baker's formulation is somewhat top down, rather than being strictly bottom-up, but at least it only scores a landmark once per (hypothesized) occurrence. However, it also is outside the scope of this project, so it will not be pursued further here.

Rescoring is naturally done top-down. That is, the lattice provides particular hypotheses to be confirmed or denied. Rather than fight this point of view, which is natural for rescoring, this note suggests instead that top-down rescoring should be embraced. We should take advantage of the characteristics of top-down rescoring, rather than fight them. Just because landmarks can be used bottom-up doesn't mean they have to be.

What difference does it make whether the rescoring is viewed as top-down or bottom-up? The bottom-up point of view seems to require that we build a complete model capable of computing a complete score from an hypothesis from the lattice. This seems to be the present plan for the project. However, unless the landmarks are converted into frame-based vectors, a new form of modeling must be invented to include the landmarks. The research on such new modeling methodology should continue. However, such a new modeling methodology is not necessary if the rescoring is viewed as a top-down process.

Viewed top-down, it is only necessary for landmarks (or any other incremental knowledge introduced in the rescoring) to be able to modify the existing hypothesis scores. That is, it is not necessary to build a completely new model for each hypothesis, but merely to make some additional observations and to improve the score of the correct hypothesis realtive to the scores of the incorrect hypotheses.

Top down verification or rejection is a much simpler task. It is not necessary to have a complete inventory of landmarks. The system would work even in the extreme case that an particular hypothesized word has no landmarks at all. Then the system would leave the word's score unchanged. For each landmark that is available, there would be incremental improvement.

In a pinched lattice, the task becomes discrimination among a small number of word choices. Consider one pair of words. The only landmarks that matter in adjusting the relative scores of these two words are those landmarks in our inventory that happen to occur in one of the words but not the other. For such a landmark it no longer matters whether the landmark was detected bottom-up. We are now asking the top down question: "Could the landmark have occurred at a location consistent with this word hypothesis?" We ignore landmarks that occur in both words. We also regretfully do nothing for landmarks that occur on one of the words but that are not in our model inventory (it is not clear how missing models could be handled in a bottom-up complete score). For landmarks that happen to be detected bottom-up, but which do not occur in either word, we might be able to model a word-dependent probability of a false alarm of the detected landmark. However, primarily the score modification will be based on top-down evaluation of landmarks that occur in one word but not the other.

A better name for this process would be "landmark-based word verification" rather than "landmark detection." Even though the scoring is done top-down, the use of landmarks is consistent with Ken Steven's philosophy. We are not trying to compute a score based on frame-by-frame observations. Attention, and score modifications, are focused on the landmarks (or the places where a word hypothesis says landmarks should be).

This idea does not need to replace the research in building a complete new form of model that includes landmarks. However, it gives a simpler method for doing the essential component of the rescoring task.

Definition of Landmarks and Distinctive Features.

(Mark Hasegawa-Johnson and Jim Baker)

The SVMs and the pronunciation model do NOT need to use the same set of distinctive features (an earlier version of this document claimed that they DO need to use the same set, but that conclusion is probably false). In fact, recognition results in possibly two separate distinctive feature transcriptions:

  1. The SVM generates a transcription of landmarks, and of distinctive features at those landmarks, based entirely on local spectral dynamics. The SVMs should model whatever landmarks and distinctive features most successfully embody the following characteristics: (1) lexically distinctive, (2) acoustically reliable.
  2. The pronunciation model observes these local distinctive feature transcripts, and attempts to match them to the distinctive feature matrix representing some known word (Comment: Must we match a feature matrix?). The pronunciation model needs to know, in advance, what distinctive features the SVM will try to detect; but the graphical model of each word need not use all or only the distinctive features detected by the SVM. In particular, the pronunciation model should use distinctive features that best represent changes in conversational speech.
"Acoustically reliable:" the acoustic correlates of a distinctive feature are observed every time the distinctive feature is correctly and fully implemented by a talker. Examples:

The acoustic signature of a reduced obstruent (e.g., "every" -> "ewry") is not the same as that of a fully implemented obstruent, but it is also not the same as that of a glide. A reduced obstruent may appear as a [+sonorant,-continuant] segment, meaning that it has a distinct closure and a distinct release, but no burst spectrum or aspiration at the moment of release. Stop place of articulation may be most reliably classified by carefully modeling the spectral dynamics within 50ms after release. Retroflex and alveolar are not the same place of articulation: the initial consonants in "tree" and "two" have different places of articulation. Doubly articulated stops should perhaps have a different place of articulation feature? For example, /p/ in "play" is labio-lateral?

"Lexically distinctive:" features of the onset consonant in a lexically stressed syllable are the least subject to reduction and assimilation in conversational speech (http://www.icsi.berkeley.edu/~steveng/PDF/Phonetic_Patterning.pdf). Of these features, place features are most lexically distinctive, meaning that knowledge of place of articulation results in a smaller list of word hypotheses than does knowledge of manner or voicing. HMM-based recognizers already model vowel features pretty well; our best hope for finding new information is to model consonant place of articulation as accurately as possible.

"Conversational speech:" the lexical model is intended to model the interdependence, among features, of processes of reduction, assimilation, and asynchrony. For example:

Nuclei of lexically unstressed syllables tend to be partially or fully reduced, thus the vowel features should depend on lexical stress (i.e., stressed vowel and unstressed vowel should not be considered identical phones). Coda consonants in unstressed syllables tend to be deleted (http://www.icsi.berkeley.edu/~steveng/PDF/Phonetic_Patterning.pdf). Manner of articulation must change at a manner-change landmark, but it need not change as specified in the lexicon: obstruents in the coda of an unstressed syllable may be implemented as sonorant "reduced obstruents." Observations will best specify consonant place of articulation at the moment of release or closure (Furui, 1986). Should the "place of articulation" hidden state variable take on a distinct value only at this moment in time, and then remain unspecified during the closure interval?

Here is an initial suggested list of landmarks and features (this list is a slightly modified subset of the features given in Stevens, Acoustic Phonetics, tables 5.3 and 5.4):

Landmarks

Should speech onset and speech offset be detected? The only reason to detect these is to distinguish them from stop closures and stop releases; they are not lexically distinctive.

Distinctive Features

The ARPABET phones /em/, /en/, /eng/, and /el/ are [+consonantal] syllable nuclei, meaning that they should match a syllable nucleus landmark, but that they also match a consonant release landmark if they are followed by a vowel ("bottle of..."). /er/ and /r/ are [-consonantal].


The Center for Language and Speech Processing
The Johns Hopkins University
3400 North Charles Street, Barton Hall
Baltimore, MD 21218
*Telephone: (410) 516-4237 *Fax: (410) 516-5050 *E-mail: clsp@clsp.jhu.edu