Our goal is to model the extensive pronunciation variation found in the Switchboard corpus, likely an important factor in the difficulty current ASR systems have on this conversational speech task. In contrast to previous efforts, we will use the recently created ICSI hand-labeled phonetic transcriptions of Switchboard as the target data of our modeling. This new corpus potentially contains a wealth of information about pronunciation in conversational speech. We will use relevant phonological, prosodic, syntactic, and discourse information as the source data of our modeling including baseform pronunciation of words, lexical stress, pitch accent, and segmental durations. We will map from source to target by various stochastic and rule-based methods including statistical decision trees, rewrite rules, and MMI. The initial measure of performance will be the reduction of the conditional entropy of the target ICSI transcriptions given the source linguistic information. Next, these mappings will be used in a speech recognizer to create alternative pronunciations in context and word error rate will be measured. As time permits, the pronunciation models created above will be used to transcribe automatically a portion of the speech corpus and then the acoustic models will be re-estimated based on these transcriptions. We will also explore generating constrained automatic alignments of all of the data as an alternative to the ICSI data.
|Michael Riley||AT&T Labs|