Audio-Visual Speech Recognition

It is well known that humans have the ability to lip-read: we combine audio and visual Information in deciding what has been spoken, especially in noisy environments. A dramatic example is the so-called McGurk effect, where a spoken sound /ga/ is superimposed on the video of a person uttering /ba/. Most people perceive the speaker as uttering the sound /da/.

We will strive to achieve automatic lip-reading by computers, i.e., to make computers recognize human speech even better than is now possible from the audio input alone, by using the video of the speaker’s face. There are many difficult research problems on the way to succeeding in this task, e.g., tracking the speakers head as she moves in the video-frame, identifying the type of lip-movement, guessing the spoken words independently from the video and the audio and combining the information from the two signals to make a better guess of what was spoken. In the summer, we will focus on a specific problem: how best to combine the information from the audio and video signal.

For example, using visual cues to decide whether a person said /ba/ rather than /ga/ can be easier than making the decision based on audio cues, which can sometimes be confusing. On the other hand, deciding between /ka/ and /ga/ is more reliably done from the audio than the video. Therefore our confidence in the audio-based and video-based hypotheses depends on the kinds of sounds being confused. We will invent and test algorithms for combining the automatic speech classification decisions based on the audio and visual stimuli, resulting in audio-visual speech recognition that significantly improves the traditional audio-only speech recognition performance.

 

Team Members
Senior Members
Andreas AndreouCLSP
Juergen LuettinIDIAP
Iain MatthewsHCII, CMU
Chalapathy NetiIBM
Gerasimos PotamianosIBM
Graduate Students
Herve GlotinICP-Grenoble, France
Undergraduate Students
Azad MashariUniversity of Toronto
June SisonUC Santa Cruz

Center for Language and Speech Processing