The Center for Language and Speech Processing




About CLSP
About CLSP
Upcoming Seminar

Bill Byrne
November 24th
4:30PM
CSEB Room B17
"Hierarchical Phrase-based Translation with Weighted Finite State Transducers "

More information »

Workshops

Multilingual Spoken Term Detection: Finding and Testing New Pronunciations

When you listen to the evening news, or read a newspaper, book or web site, there is a good chance that you will hear or see a term -- perhaps a name, perhaps a technical term -- that you have never seen before. Such words are often novel or rare and are often names (of people, places, organizations...). They are hard for humans to process, but they are even harder for automatic speech and language processing systems.

For a single language, a speech recognition or text-to-speech system needs to know how to pronounce a word to recognize or say it. For two languages, in particular a pair with different writing systems, a search engine or document summarizer needs to know how to transliterate one word to another to retrieve or distill across languages. For example, the soccer player written in English text as Maciej Zurawski would appear as "machiei julapeuseuki" in Korean (simplified here using the Latin rather than Korean script).

In this project we will attack both problems -- unusual term pronunciation and term transliteration -- in a combined research effort. For pronunciation, we will make use of the huge numbers of pronunciations that are now available in various forms on the web to mine pronunciations. This ranges from straightforward, such as dictionary sites and Wikipedia entries where people use a fairly strict phonetic transcription system such as IPA, to difficult such as:

Here we need to look in the vicinity of the name "Capecchi" to find the pronunciation, make use of the word "pronounced", and then interpret the writer's attempt to render the pronunciation using an English-based ad-hoc "phonetic" orthography. The problem is therefore one of entity extraction, where the entities to extract can be either relatively easy or relatively hard. A relatively easy case is Wikipedia, which uses standard IPA transcriptions that are clearly delimited by markup. On general web pages, tokens with Unicode IPA characters are potential pronunciations. Data extracted from Wikipedia can be matched against these tokens to provide training material for entity extraction. Statistical entity extractors for the more difficult case of ad-hoc phonetic transcriptions (such as "kuh-PEK'-ee" above) can be bootstrapped from unannotated web pages containing patterns such as "pronounced as". These entity extractors will make use of both the textual environment and the letter-to-sound constraints between the candidate pronunciation and its corresponding orthography.

We will also use speech data to test possible pronunciation variants by comparing the performance of spoken term detection systems using these different variants. Pronunciations mined from the web will be used to suggest pronunciations for spoken term detection; transliteration will be used to suggest reasonable candidates to search for in a speech stream in another language. We will use a novel technique called delayed-decision testing to test candidate pronunciations in speech, and to choose the best one from a set of candidates via a sequential testing procedure, with the associated null hypothesis stating that all candidate pronunciations will exhibit the same performance on average. Spoken term detection will in turn be used for automatic labeling of practice data acquired to test this null hypothesis; however, this automatic labeling procedure will inevitably induce false alarms as well as correct detections. Delayed-decision testing will then be used to choose the correct pronunciation in spite of these false alarms, leading to improved pronunciations for newly identified terms.

For transliteration, we will use available resources -- dictionaries, and text corpora -- as well as methods for phonetic matching across scripts and tracking names across time in comparable corpora (such as news sources). In previous work at UIUC, JHU and many other sites, people have investigated phonetic transliteration models trained from lexicons. More recently, we have developed techniques to guess transliteration equivalents using pronunciation estimates for English terms, pronunciation guesses for the foreign term, and phonetic distances based upon standard phonetic features as well as "pseudofeatures" based on phonetic substitutions observed in second-language learners of English. Reasonable transliteration matches can be found using hand-tuned costs based on these features, though improved performance can be demonstrated by discriminative training of the weights on even a short dictionary of transliterations. We have also investigated using time correlations of terms across comparable corpora, such as newswire text. Related terms, including transliterations of the same name, distribute similarly in time, and this is powerful additional evidence over and above phonetic similarity. The goals for this workshop are to collect the best of these tools together, tune them, make them publicly available, and use them to develop transliterators for 20 language pairs (English-X), including at least 10 underresourced languages.

The result of this workshop will include tools for finding new pronunciations on the web, tools for developing transliteration systems, concrete pronunciation dictionaries and transliterators for various languages, and improvements in methodologies for spoken term detection and for using speech data as evidence for word pronunciation.


Team Members

Team Leader
   Richard Sproat rws at xoba dot com University of Illinois
Senior Personnel
     Jim Baker james dot karl dot baker at gmail dot com Johns Hopkins University
Martin Jansche jansche at acm dot org Google Inc
Michael Riley riley at google dot com Google Inc
Murat Saraclar murat dot saraclar at boun dot edu dot tr Bogazici University
Abinav Sethy asethy at us dot ibm dot com IBM
Patrick Wolfe patrick at seas dot harvard dot edu Harvard University
Graduate Students
     Arnab Ghoshal ag at jhu dot edu Johns Hopkins University
Kristy Hollingshead hollingk at cslu dot ogi dot edu Oregon Health & Science University
Christopher White Johns Hopkins University
Undergraduate Students
     Ting Qian ting dot qian at rochester dot edu University of Rochester
Erica Cooper ecooper at mit dot edu Massachusetts Institute of Technology
Morgan Ulinski meu3 at cornell dot edu Cornell University
Affiliates
     Mona Diab md2370 at columbia dot edu Columbia University
     Bhuvana Ramabhadran bhuvana at us dot ibm dot com IBM
Students Attending Pre-Workshop Summer School
     Pavel Ceska ceska at ufal dot mff dot cuni dot cz Charles University, Prague