Previous research on large-vocabulary automatic speech recognition (ASR)
has mainly concentrated on European and Asian languages. Other language
groups have been explored to a lesser extent, for instance Semitic
languages like Hebrew and Arabic. These languages possess certain
characteristics which present problems for standard ASR systems. For
example, their written representation does not contain most of the vowels
present in the spoken form, which makes it difficult to utilize textual
training data. Furthermore, they have a complex morphological structure,
which is characterized not only by a high degree of affixation but also by
the interleaving of vowel and consonant patterns (so-called
"non-concatenative morphology"). This leads to a large number of possible
word forms, which complicates the robust estimation of statistical
language models.
In this workshop group we aim to develop new modeling approaches to
address these and related problems, and to apply them to the task of
conversational Arabic speech recognition. We will develop and evaluate a
multi-linear language model, which decomposes the task of predicting a
given word form into predicting more basic morphological patterns and
roots. Such a language model can be combined with a similarly decomposed
acoustic model, which necessitates new decoding techniques based on
modeling statistical dependencies between loosely coupled information
streams. Since one pervading issue in language processing is the tradeoff
between language-specific and language-independent methods, we will also
pursue an alternative control approach which relies on the capabilities of
existing, language-independent recognition technology. Under this
approach no mophological analysis will be performed and all word forms
will be treated as basic vocabulary units. Furthermore, acoustic model
topologies will be used which specify short vowels as optional rather than
obligatory elements, in order to facilitate the use of text documents as
language model training data. Finally, we will investigate the
possibility of using large, generally available text and audio sources to
improve the accuracy of conversational Arabic speech recognition.
|