Acoustic-optical Phonetics and Audiovisual Speech Perception

April 17, 2001 all-day

Several sources of behavioral evidence show that speech perception is audiovisual when acoustic and optical speech signals are afforded the perceiver. The McGurk effect and enhancements to auditory speech intelligibility in noise are two well-known examples. In comparison with acoustic phonetics and auditory speech perception, however, relatively little is known about optical phonetics and visual speech perception. Likewise, how optical and acoustic signals are related and how they are integrated perceptually remains an open question. We have been studying relationships between kinematic and acoustic recordings of speech. The kinematic recordings were made with an optical recording system that tracked movements on talkers faces and with a magnetometer system that simultaneously tracked tongue and jaw movements. Speech samples included nonsense syllables and sentences from four talkers, prescreened for visual intelligibility. Mappings among the kinematic and acoustic signals show a perhaps surprisingly high degree of correlation. However, demonstrations of correlations in speech signals are not evidence about perceptual mechanisms responsible for audiovisual integration. Perceptual evidence from McGurk experiments has been used to hypothesize early phonetic integration of visual and auditory speech information, even though some of these experiments have also shown that the effect occurs despite relatively long crossmodal temporal asynchronies. The McGurk effect can be made to occur when acoustic /ba/ is combined in synchrony with a visual /ga/, typically resulting in the perceiver reporting having heard /da/. To investigate the time course and cortical location of audiovisual integration, we obtained event-related potentials (ERPs) from twelve adults, prescreened for McGurk susceptibility. Stimuli were presented in an oddball paradigm to evoke the mismatch negativity (MMN), a neurophysiological discrimination measure, most robustly demonstrated with acoustic contrasts. Conditions were audiovisual McGurk stimuli, visual-only stimuli from the McGurk condition, and auditory stimuli corresponding to the McGurk condition percepts (/ba/-/da/). The magnitude (area) of the MMN for the audiovisual condition was maximal at a latency > 300ms, much later than the maximal magnitude of the auditory MMN (approximately 260ms), suggesting that integration occurs later than auditory phonetic processing. Additional latency, amplitude, and dipole source analyses revealed similarities and differences between the auditory, visual, and audiovisual conditions. Results support an audiovisual integration neural network that is at least partly distinct from and operates at a longer latency than unimodal networks. In addition, results showed dynamic differences in processing across correlated and uncorrelated audiovisual combinations. These results point to a biocomplex system. We can consider the agents of complexity theory to be for our case (non-inclusively) the unimodal sensory/perceptual systems, which have important heterogeneous characteristics. Auditory and visual perception have their own organization and when combined apparently participate in another organization. Apparently also, the dynamics of audiovisual organization vary depending on the correlation of the acoustic-optical phonetic signals. This view contrasts with views of audiovisual integration based primarily on consideration of algorithms or formats for information combination.

Center for Language and Speech Processing