Robust Representation of Attended Speech in Human Brain with Implications for ASR – Nima Mesgarani (University of California, San Francisco)
Humans possess a remarkable ability to attend to a single speaker’s voice in a multi-talker background. How the auditory system manages to extract intelligible speech under such acoustically complex and adverse listening conditions is not known, and indeed, it is not clear how attended speech is internally represented. Here, using multi-electrode recordings from the cortex of epileptic patients engaged in a listening task with two simultaneous speakers, we demonstrate that population responses in the temporal lobe faithfully encode critical features of attended speech: speech spectrograms reconstructed based on cortical responses to the mixture of speakers reveal salient spectral and temporal features of the attended speaker, as if listening to that speaker alone. Therefore, a simple classifier trained solely on examples of single speakers can decode both attended words and speaker identity. We find that task performance is well predicted by a rapid increase in attention-modulated neural selectivity across both local single-electrode and population-level cortical responses. These findings demonstrate that the temporal lobe cortical representation of speech does not merely reflect the external acoustic environment, but instead correlates to the perceptual aspects relevant for the listener’s intended goal. An engineering approach for ASR that is inspired by a model of this process is shown to improve recognition accuracy in new noisy conditions.
Nima Mesgarani is a postdoctoral scholar at the department of neurological surgeries of University of California San Francisco. He received his Ph.D. in electrical engineering from University of Maryland College Park. He was a postdoctoral fellow at Center for Speech and Language processing at Johns Hopkins University prior to joining UCSF. His research interests include studying the representation of speech in brain and its implications for speech processing technologies.