Prevailing approaches to automatic speech recognition (hidden Markov models, finite-state transducers) are typically based on the assumption that a word can be represented as a single sequence of phonetic states. However, the production of a word involves the simultaneous motion of several articulators, such as the lips and tongue, which may move asynchronously and may not always reach their target positions. This may be more naturally and parsimoniously modeled using multiple streams of hidden states, each corresponding to an articulatory feature (AF). Recent theories of phonology support this idea, representing words using multiple streams of sub-phonetic features, which may be either directly related to the articulators or more abstract (e.g. manner and place). In addition, factoring the observation model of a recognizer into multiple factors, each corresponding to a different AF, may allow for savings in training data. Finally, such an approach can be naturally applied to audio-visual speech recognition, in which the asynchrony between articulators is particularly striking; and multilingual speech recognition, which may leverage the universality of some AFs across languages.
This project will explore the large space of possible AF-based models for automatic speech recognition, on both audio-only and audio-visual tasks. While a good deal of previous work has investigated various components of such a recognizer, such as AF classifiers and AF-based pronunciation models, little effort has gone into building complete, fully AF-based recognizers. Our models will be represented as dynamic Bayesian networks. This is a natural framework for modeling processes with inherent factorization of the state space, and allows for investigation of a large variety of models using universal training and decoding algorithms.
|Simon King||University of Edinburgh|
|Chris Bartels||University of Washington|
|Partha Lal||University of Edinburgh|
|Lisa Yung||Johns Hopkins University|