Audio-Visual Speech Recognition

Research Group of the 2000 Summer Workshop

It is well known that humans have the ability to lip-read: we combine audio and visual Information in deciding what has been spoken, especially in noisy environments. A dramatic example is the so-called McGurk effect, where a spoken sound /ga/ is superimposed on the video of a person uttering /ba/. Most people perceive the speaker as uttering the sound /da/.

We will strive to achieve automatic lip-reading by computers, i.e., to make computers recognize human speech even better than is now possible from the audio input alone, by using the video of the speaker’s face. There are many difficult research problems on the way to succeeding in this task, e.g., tracking the speakers head as she moves in the video-frame, identifying the type of lip-movement, guessing the spoken words independently from the video and the audio and combining the information from the two signals to make a better guess of what was spoken. In the summer, we will focus on a specific problem: how best to combine the information from the audio and video signal.

For example, using visual cues to decide whether a person said /ba/ rather than /ga/ can be easier than making the decision based on audio cues, which can sometimes be confusing. On the other hand, deciding between /ka/ and /ga/ is more reliably done from the audio than the video. Therefore our confidence in the audio-based and video-based hypotheses depends on the kinds of sounds being confused. We will invent and test algorithms for combining the automatic speech classification decisions based on the audio and visual stimuli, resulting in audio-visual speech recognition that significantly improves the traditional audio-only speech recognition performance.

Final Presentation Video

Team Members
Senior Members
Andreas Andreou	CLSP
Juergen Luettin	IDIAP
Iain Matthews	HCII, CMU
Chalapathy Neti	IBM
Gerasimos Potamianos	IBM
Graduate Students
Herve Glotin	ICP-Grenoble, France
Undergraduate Students
Azad Mashari	University of Toronto
June Sison	UC Santa Cruz

Audio-Visual Speech Recognition

Upcoming Seminars

Center for Language and Speech Processing