It is well known that humans have the ability to lip-read: we combine audio and visual Information in deciding what has been spoken, especially in noisy environments. A dramatic example is the so-called McGurk effect, where a spoken sound /ga/ is superimposed on the video of a person uttering /ba/. Most people perceive the speaker as uttering the sound /da/.
We will strive to achieve automatic lip-reading by computers, i.e., to make computers recognize human speech even better than is now possible from the audio input alone, by using the video of the speaker’s face. There are many difficult research problems on the way to succeeding in this task, e.g., tracking the speakers head as she moves in the video-frame, identifying the type of lip-movement, guessing the spoken words independently from the video and the audio and combining the information from the two signals to make a better guess of what was spoken. In the summer, we will focus on a specific problem: how best to combine the information from the audio and video signal.
For example, using visual cues to decide whether a person said /ba/ rather than /ga/ can be easier than making the decision based on audio cues, which can sometimes be confusing. On the other hand, deciding between /ka/ and /ga/ is more reliably done from the audio than the video. Therefore our confidence in the audio-based and video-based hypotheses depends on the kinds of sounds being confused. We will invent and test algorithms for combining the automatic speech classification decisions based on the audio and visual stimuli, resulting in audio-visual speech recognition that significantly improves the traditional audio-only speech recognition performance.
|Iain Matthews||HCII, CMU|
|Herve Glotin||ICP-Grenoble, France|
|Azad Mashari||University of Toronto|
|June Sison||UC Santa Cruz|