This work grew out of a desire to model disfluencies in conversational speech, but was generalized to accommodate modeling of other phenomena, most notably other aspects of conversational speech, such as hedges, flavor words etc. The current goals for this work are:
These goals will be attempted by developing a general mechanism for case-based modeling of linguistic events within an otherwise Ngram framework. The events thus modelled may be hidden, namely it may not be obvious from the written text whether or not they occurred. In such cases, both alternatives are hypothesized.
Thus there are two distinct ideas here:
Hidden Markov modeling is central to acoustic modeling in this
decade. But the staple language model, the Ngram, models language as
a non-hidden Markov chain. Occasionally the need for modeling a
hidden linguistic event arises, such as when segment boundaries are
unknown and have to be hypothesized, or when an utterance such as 'YOU
KNOW' could be either a discourse marker or a normal sentential
Viewing language as a hidden (typically Markovian) process was done in POS tagging, in sense disambiguation and occasionally in class-based language models. Here it will be used in the context of a word-based ngram, to model ambiguous events.
Word-based Ngram modeling is generally hard to beat across the board. But there may be specific linguistic phenomena which can be better predicted by consulting other sources of information. Thus it would be useful to have a mechanism in place whereby specific linguistic events can be modeled using arbitrary features of the history (or even of the future) while the rest of the events in the vocabulary continue to be predicted by an Ngram. The specific features to be used for each of the special events can be determined offline by analyzing the correlation between the event and a set of candidate predictors.
In a similar way, the history of a partial hypothesis can be manipulated such that the conventional Ngram consider not the 'nominal' context (the last N-1 words) but rather some other, 'effective' context (perhaps skipping some words). Again, this is done on a case-by-case basis, only for situations where it is deemed preferable.
UH would be predicted using:
P(UH |A B C D) (generally _not_ an Ngram probability)
Furthermore, when predicting the events following 'UH', it may or may not be beneficial to retain 'UH' in their Ngram context. The decision whether or not to do so can be made offline, by analyzing the expected change in entropy. If it is determined that it is better to remove 'UH' from the context, 'E' will be predicted using:
P(E | C D) (a trigram probability)
the second 'C' could be annotated as a single-word-repetition and predicted using:
P(single-word-repetition|A B C) (generally _not_ an Ngram probability)
or it could be annotated as a non-disfluent, grammatical word, in which case it will be predicted using:
P(C | B C) (A trigram probability)
Both annotations must be entertained. Thus the second occurrence of 'C' represents either one of two 'hidden' events.
The algorithm implementing both ideas above uses a straighforward Dynamic Programming technique:
The algorithm as described above was implemented in C (click here for the code) , using invocations of various functions in the CMU SLM Toolkit.
Analysis of predictors and generation of associated black boxes was performed by Rukmini, who experimented with several modeling options for a specific subset of disfluencies (click here for details) .
We now believe that many types of disfluencies will _not_ benefit directly from extra-ngram modeling. This conclusion is based on:
However, goal #2 of this work (disfluency cleaning) is still valid and worth pursuing. Perhaps more importantly, this newly created mechanism is being used for "inverse bleaching" (see goal #3) of all phenomana peculiar to conversational speech.