MODELING POTENTIALLY HIDDEN LINGUISTIC EVENTS EXTERNALLY TO THE NGRAM

Motivation and Goals

This work grew out of a desire to model disfluencies in conversational speech, but was generalized to accommodate modeling of other phenomena, most notably other aspects of conversational speech, such as hedges, flavor words etc. The current goals for this work are:

  1. Better language modeling (as measured by PP and WER) by paying special attention to specific phenomena.
  2. Annotation of hidden events. In the case of disfluencies, this enables automatic disfluency-cleaning.
  3. Modeling of idiosyncratic events that had previosuly been removed from the data stream. This is the last stage in the "data bleaching" paradigm, during which one inverts the bleaching that had been previously applied to the target domain (see there).


Basic Ideas:

These goals will be attempted by developing a general mechanism for case-based modeling of linguistic events within an otherwise Ngram framework. The events thus modelled may be hidden, namely it may not be obvious from the written text whether or not they occurred. In such cases, both alternatives are hypothesized.

Thus there are two distinct ideas here:

  1. Treating language as a hidden process:

    Hidden Markov modeling is central to acoustic modeling in this decade. But the staple language model, the Ngram, models language as a non-hidden Markov chain. Occasionally the need for modeling a hidden linguistic event arises, such as when segment boundaries are unknown and have to be hypothesized, or when an utterance such as 'YOU KNOW' could be either a discourse marker or a normal sentential element.
    Viewing language as a hidden (typically Markovian) process was done in POS tagging, in sense disambiguation and occasionally in class-based language models. Here it will be used in the context of a word-based ngram, to model ambiguous events.

  2. Case-Based modeling externally to an Ngram:

    Word-based Ngram modeling is generally hard to beat across the board. But there may be specific linguistic phenomena which can be better predicted by consulting other sources of information. Thus it would be useful to have a mechanism in place whereby specific linguistic events can be modeled using arbitrary features of the history (or even of the future) while the rest of the events in the vocabulary continue to be predicted by an Ngram. The specific features to be used for each of the special events can be determined offline by analyzing the correlation between the event and a set of candidate predictors.

In a similar way, the history of a partial hypothesis can be manipulated such that the conventional Ngram consider not the 'nominal' context (the last N-1 words) but rather some other, 'effective' context (perhaps skipping some words). Again, this is done on a case-by-case basis, only for situations where it is deemed preferable.


Examples

  1. To model the filled pause 'UH' as an extra-Ngram event, its occurrences can be studied offline, and a good set of predictors be chosen based on the history. Then, in a sentence of the form:

    'A B C D UH E F'

    UH would be predicted using:

      P(UH |A B C D)  (generally _not_ an Ngram probability) 

    Furthermore, when predicting the events following 'UH', it may or may not be beneficial to retain 'UH' in their Ngram context. The decision whether or not to do so can be made offline, by analyzing the expected change in entropy. If it is determined that it is better to remove 'UH' from the context, 'E' will be predicted using:

      P(E | C D)	(a trigram probability)

  2. To model single-word repetitions as extra-Ngram events, again the former must be studied offline (using disfluency-annotated data), and predicted based on some arbitrary features of the history. Then, in a sentence of the form:

    'A B C C D E',

    the second 'C' could be annotated as a single-word-repetition and predicted using:

    P(single-word-repetition|A B C) (generally _not_ an Ngram probability)

    or it could be annotated as a non-disfluent, grammatical word, in which case it will be predicted using:

      P(C | B C)	(A trigram probability)

    Both annotations must be entertained. Thus the second occurrence of 'C' represents either one of two 'hidden' events.


The Algorithm

The algorithm implementing both ideas above uses a straighforward Dynamic Programming technique:

  1. The sentence is processed left-to-right (although generalization to island-driven processing should be straightforward).
  2. While at position L in the sentence, a table is kept of annotated partial hypotheses up to position L-1. For each of these annotations:
  3. The probability of each extra-Ngram event at position L is determined by invoking a 'black-box' function which has access to the entire (annotated) history, and which implements the case-based modeling of that event.
  4. The actual word at position L is examined, and for each extra-Ngram event that is consistent with it, a new annotated hypothesis is created which extends the current hypothesis by appending that event. The 'effective context' of the new hypothesis is also determined by the appropriate black box.
    (Typically, there would be 0--3 events consistent with the actual word. For example, if the current word is 'WELL' and the history ended in 'WELL', then the current word is consistent with 'DM_WELL' (the discourse-marker), 'REP1' (a single-word repetition) and 'WELL' (the non-disfluent, grammatical word, modeled by the conventional Ngram).)
  5. After considering the probabilities of all the extra-Ngram events (including those that are _not_ consistent with the current word), the remaining probability mass is allocated to the Ngram. Thus the Ngram predictions are scaled down to fit into that mass.
    Note: since the different 'black-boxes' may use different features of the history, their predictions may not be mutually consistent. In particular, the remaining probability mass may not be positive. This is not likely to happen in practice, because the extra-Ngram events typically have very low probability. Still, this extreme condition must be checked for at runtime.
  6. If the effective context of any of the new hypotheses is the same as that of an existing hypothesis, the two are merged: their probabilities are summed together, and the more likely annotation is retained.
  7. After processing the entire sentence, the algorithm reports the combined score of all the annotations and the most likely annnotation.


Implementation:

The algorithm as described above was implemented in C (click here for the code) , using invocations of various functions in the CMU SLM Toolkit.

Analysis of predictors and generation of associated black boxes was performed by Rukmini, who experimented with several modeling options for a specific subset of disfluencies (click here for details) .


Conclusions:

We now believe that many types of disfluencies will _not_ benefit directly from extra-ngram modeling. This conclusion is based on:

  1. The preliminary implmentation of some types of disfluencies as extra Ngram events in this work performed by Rukmini .
  2. Andreas and Liz's attempt to model other types of disfluencies as trigram events, and their subsequent analysis .
  3. Rajeev's sentence-based error analysis, and Bill's word-based error analysis, which showed that recognition errors do not appear correlated with disfluencies. Many disfluencies are recognized fairly well, maybe because they are acoustically distinct, and maybe also because they are well trained linguistically: the most common types are disfluencies are as common as the most common words in the vocabulary.

However, goal #2 of this work (disfluency cleaning) is still valid and worth pursuing. Perhaps more importantly, this newly created mechanism is being used for "inverse bleaching" (see goal #3) of all phenomana peculiar to conversational speech.