Prosody in Spoken Language Processing – Izhak Shafran (Johns Hopkins University)

November 29, 2005 all-day

Automatic speech recognition is now capable of transcribing speech from a variety of sources with high accuracy. This has opened up new opportunities and challenges in translation, summarization and distillation. Currently, most applications only extract the sequence of words from a speakers voice and ignore other useful information that can be inferred from speech such as prosody. Prosody has been studied extensively by linguists and is often characterized in terms of phrasing _LP_break indices_RP_, tones and emphasis _LP_prominence_RP_. The availability of a prosodically labeled corpus of conversational speech has spurred renewed interest in exploiting prosody for downstream applications. As a first step, an automatic method is needed to detect prosodic events. For this task, we have investigated the performance of a series of classifiers with increasing complexity, namely, decision tree, bagging-based classifier, random forests and hidden Markov models of different orders. Our experiments show that break indices and prominence can be detected with accuracies above 80%, making them useful for practical applications. Two such examples were explored. In the context of disfluency detection, the interaction between the prosodic interruption point and the syntactic EDITED constituents were modeled with a simple and direct model — a PCFG with additional tags. The preliminary results are promising and show that the F-score of the EDIT constituent improves significantly without degrading the overall F-measure significantly. The task of building elaborate generative models is difficult, largely, due to the lack of an authoritative theory on syntax-phonology interface. An alternative approach is to incorporate the interaction as features in a discriminative framework for parsing, speech recognition or metadata detection. As an example, we illustrate how this can be done in a sentence boundary detection using a re-ranking framework and show improvements on a state-of-the-art system. The work reported in this talk was carried out in the 2005 JHU workshop and previously at University of Washington in collaboration with several researchers.

Center for Language and Speech Processing