| CLSP Homepage : Workshop Homepage | |
![]() | |
| Workshop 2004 | Sunday, November 23, 2008 |
The goal of this workshop is to learn high-dimensional models of high-temporal-resolution spectral dynamics near instants of consonant release, consonant closure, and syllable nuclei --- information complementary to the information extracted by a best-of-breed Switchboard recognizer --- and to use this novel information to improve WER of the ML path through a Switchboard recognition lattice.
Methods are still under discussion (January 2004), but several people like the idea of a top-down approach (comment on bottomup vs topdown). Here is one possible approach, moving in from lattice to landmarks, then out again from landmarks to lattice:
Landmarks are ideally suited for bottom up processing, as implied by the phrase "Landmark Detection". However, lattice rescoring is most naturally done top down.
This note argues that the conflict should be made explicit and taken on head-on. Ignoring it will only lead to confusion.
Landmarks could be used in at least three different parts of a recognition system:
The simplest way to use landmarks in the main search is to define a bit vector for each frame indicating what landmarks, if any, have been detected at that point in time. Jeff Bilmes suggested this in the workshop planning meeting. However, this representation goes directly against the philosophy under which Ken Stevens proposed landmarks in the first place. Jim Baker argued at the planning meeting that there are other ways to use landmarks in the main search that do not require treating them as something that is observed (or not) at every frame. Baker's formulation is somewhat top down, rather than being strictly bottom-up, but at least it only scores a landmark once per (hypothesized) occurrence. However, it also is outside the scope of this project, so it will not be pursued further here.
Rescoring is naturally done top-down. That is, the lattice provides particular hypotheses to be confirmed or denied. Rather than fight this point of view, which is natural for rescoring, this note suggests instead that top-down rescoring should be embraced. We should take advantage of the characteristics of top-down rescoring, rather than fight them. Just because landmarks can be used bottom-up doesn't mean they have to be.
What difference does it make whether the rescoring is viewed as top-down or bottom-up? The bottom-up point of view seems to require that we build a complete model capable of computing a complete score from an hypothesis from the lattice. This seems to be the present plan for the project. However, unless the landmarks are converted into frame-based vectors, a new form of modeling must be invented to include the landmarks. The research on such new modeling methodology should continue. However, such a new modeling methodology is not necessary if the rescoring is viewed as a top-down process.
Viewed top-down, it is only necessary for landmarks (or any other incremental knowledge introduced in the rescoring) to be able to modify the existing hypothesis scores. That is, it is not necessary to build a completely new model for each hypothesis, but merely to make some additional observations and to improve the score of the correct hypothesis realtive to the scores of the incorrect hypotheses.
Top down verification or rejection is a much simpler task. It is not necessary to have a complete inventory of landmarks. The system would work even in the extreme case that an particular hypothesized word has no landmarks at all. Then the system would leave the word's score unchanged. For each landmark that is available, there would be incremental improvement.
In a pinched lattice, the task becomes discrimination among a small number of word choices. Consider one pair of words. The only landmarks that matter in adjusting the relative scores of these two words are those landmarks in our inventory that happen to occur in one of the words but not the other. For such a landmark it no longer matters whether the landmark was detected bottom-up. We are now asking the top down question: "Could the landmark have occurred at a location consistent with this word hypothesis?" We ignore landmarks that occur in both words. We also regretfully do nothing for landmarks that occur on one of the words but that are not in our model inventory (it is not clear how missing models could be handled in a bottom-up complete score). For landmarks that happen to be detected bottom-up, but which do not occur in either word, we might be able to model a word-dependent probability of a false alarm of the detected landmark. However, primarily the score modification will be based on top-down evaluation of landmarks that occur in one word but not the other.
A better name for this process would be "landmark-based word verification" rather than "landmark detection." Even though the scoring is done top-down, the use of landmarks is consistent with Ken Steven's philosophy. We are not trying to compute a score based on frame-by-frame observations. Attention, and score modifications, are focused on the landmarks (or the places where a word hypothesis says landmarks should be).
This idea does not need to replace the research in building a complete new form of model that includes landmarks. However, it gives a simpler method for doing the essential component of the rescoring task.
The SVMs and the pronunciation model do NOT need to use the same set of distinctive features (an earlier version of this document claimed that they DO need to use the same set, but that conclusion is probably false). In fact, recognition results in possibly two separate distinctive feature transcriptions:
The acoustic signature of a reduced obstruent (e.g., "every" -> "ewry") is not the same as that of a fully implemented obstruent, but it is also not the same as that of a glide. A reduced obstruent may appear as a [+sonorant,-continuant] segment, meaning that it has a distinct closure and a distinct release, but no burst spectrum or aspiration at the moment of release. Stop place of articulation may be most reliably classified by carefully modeling the spectral dynamics within 50ms after release. Retroflex and alveolar are not the same place of articulation: the initial consonants in "tree" and "two" have different places of articulation. Doubly articulated stops should perhaps have a different place of articulation feature? For example, /p/ in "play" is labio-lateral?
"Lexically distinctive:" features of the onset consonant in a lexically stressed syllable are the least subject to reduction and assimilation in conversational speech (http://www.icsi.berkeley.edu/~steveng/PDF/Phonetic_Patterning.pdf). Of these features, place features are most lexically distinctive, meaning that knowledge of place of articulation results in a smaller list of word hypotheses than does knowledge of manner or voicing. HMM-based recognizers already model vowel features pretty well; our best hope for finding new information is to model consonant place of articulation as accurately as possible.
"Conversational speech:" the lexical model is intended to model the interdependence, among features, of processes of reduction, assimilation, and asynchrony. For example:
Nuclei of lexically unstressed syllables tend to be partially or fully reduced, thus the vowel features should depend on lexical stress (i.e., stressed vowel and unstressed vowel should not be considered identical phones). Coda consonants in unstressed syllables tend to be deleted (http://www.icsi.berkeley.edu/~steveng/PDF/Phonetic_Patterning.pdf). Manner of articulation must change at a manner-change landmark, but it need not change as specified in the lexicon: obstruents in the coda of an unstressed syllable may be implemented as sonorant "reduced obstruents." Observations will best specify consonant place of articulation at the moment of release or closure (Furui, 1986). Should the "place of articulation" hidden state variable take on a distinct value only at this moment in time, and then remain unspecified during the closure interval?
Here is an initial suggested list of landmarks and features (this list is a slightly modified subset of the features given in Stevens, Acoustic Phonetics, tables 5.3 and 5.4):
| The Center for Language and Speech Processing The Johns Hopkins University 3400 North Charles Street, Barton Hall Baltimore, MD 21218 | |||||
| Telephone: (410) 516-4237 | Fax: (410) 516-5050 | E-mail: clsp@clsp.jhu.edu | |||