Speech Recognition by Humans and Machines

A lecture given by Richard P. Lippmann of MIT's Lincoln Laboratory
at the 1996 Summer Workshop on Innovative Techniques for Large Vocabulary Conversational Speech Recognition CLSP/JHU on
August 7, 1996 at 3pm in Arellano Theater, Levering Hall.


This talk reviews past research on human speech perception and recent studies which compare the performance of humans and speech recognizers using six modern speech corpora with vocabularies ranging from 10 to 65,000 words. Error rates of machines are often more than an order of magnitude greater than those of humans for quiet, clearly spoken speech. Machine performance degrades further below that of humans in noise and under other stressing conditions. Human performance remains high with natural variability caused by new talkers, spontaneous speaking styles, noise, and reverberation. Human performance also remains high with unnatural degradations caused by waveform clipping, band-reject filtering, and analog waveform scrambling. Humans can also recognize quiet, clearly spoken nonsense syllables and words without high-level grammatical information. Much further algorithm development is required before even the low-level acoustic-phonetic accuracy of machines equals that of humans on real-world tasks. Obtaining such human performance levels under common speech degradations is an essential step towards expanding current commercially-successful niche applications of speech recognition technology into a more widespread user community.