Ken Grant (Walter Reed National Military Medical Center) “Speech Perception with Minimal Acoustics Cues Informs Novel Approaches to Automatic Speech Recognition”

February 2, 2018 @ 12:00 pm – 1:15 pm
Hackerman Hall B17
Malone Hall
3400 N Charles St, Baltimore, MD 21218


When confronted with the daunting task of transmitting speech information to deaf individuals, one comes quickly to the conclusion that the solution to this problem requires a full-blown theory of speech perception. Because the bandwidth and dynamic range of speech far exceeds the capacity of the deaf ear, radical recoding of important speech information and sensory substitution schemes have been proposed. Within this framework, at least four major questions must be addressed: 1) What are the essential elements of the signal that must be transmitted? 2) What is the information capacity of the receiving sensory system? 3) Does the information capacity of the receiving system match (or exceed) the demands of the signal(s) being transmitted, and if it doesn’t, how should the signal information be recoded to be better matched to the receiving systems capabilities? 4) What methods will be used to evaluate the success (or failure) of the enterprise? The advantage of dissecting the problem into these 4 crucial questions is that one can develop a systematic approach to understanding speech recognition that applies equally to sensory substitution such as tactile speech aids, advanced bionics such as cochlear implants, or hearing aids. For this talk, I will present several examples of bimodal and unimodal speech recognition where high levels of intelligibility are achieved with minimal auditory information or by incorporating visual speech information gleaned from lipreading (i.e., spreechreading). In the bimodal examples, the amount of transmitted auditory speech information is insufficient to support word or sentence intelligibility (zero percent correct), and the average speechreading performance, even for the very best speechreader (who is usually a deaf individual) might be 10-30% word or sentence intelligibility. Similar findings have been shown for auditory-only speech inputs for signals composed of disjoint and non-overlapping spectral bands where over 90% of the spectral information has been discarded. The very fact that high levels of speech intelligibility (>80%) can be achieved with multimodal inputs where auditory and visual modalities individually fail to transmit enough information to support speech perception and for unimodal inputs composed of combinations of spectral bands where individual bands provide minimal acoustic information may suggest novel approaches to automatic speech recognition.


Center for Language and Speech Processing