Previous research on large-vocabulary automatic speech recognition (ASR) has mainly concentrated on European and Asian languages. Other language groups have been explored to a lesser extent, for instance Semitic languages like Hebrew and Arabic. These languages possess certain characteristics which present problems for standard ASR systems. For example, their written representation does not contain most of the vowels present in the spoken form, which makes it difficult to utilize textual training data. Furthermore, they have a complex morphological structure, which is characterized not only by a high degree of affixation but also by the interleaving of vowel and consonant patterns (so-called “non-concatenative morphology”). This leads to a large number of possible word forms, which complicates the robust estimation of statistical language models.
In this workshop group we aim to develop new modeling approaches to address these and related problems, and to apply them to the task of conversational Arabic speech recognition. We will develop and evaluate a multi-linear language model, which decomposes the task of predicting a given word form into predicting more basic morphological patterns and roots. Such a language model can be combined with a similarly decomposed acoustic model, which necessitates new decoding techniques based on modeling statistical dependencies between loosely coupled information streams. Since one pervading issue in language processing is the tradeoff between language-specific and language-independent methods, we will also pursue an alternative control approach which relies on the capabilities of existing, language-independent recognition technology. Under this approach no mophological analysis will be performed and all word forms will be treated as basic vocabulary units. Furthermore, acoustic model topologies will be used which specify short vowels as optional rather than obligatory elements, in order to facilitate the use of text documents as language model training data. Finally, we will investigate the possibility of using large, generally available text and audio sources to improve the accuracy of conversational Arabic speech recognition.
|Jeff Bilmes||University of Washington|
|Katrin Kirchhoff||University of Washington|
|Rich Schwartz||BBN Technologies|
|Gang Ji||University of Washington|
|Mohamed Noamany||BBN Technologies|
|Melissa Egan||Pomona College|
|Feng He||Swarthmore College|