SuperSID: Exploiting High-level Information for High-performance Speaker Recognition

Identifying individuals based on their speech is an important component technology in many application, be it automatically tagging speakers in the transcription of a board-room meeting (to track who said what), user verification for computer security or picking out a known terrorist or narcotics trader among millions of ongoing satellite telephone calls.
How do we recognize the voices of the people we know? Generally, we use multiple levels of speaker information conveyed in the speech signal. At the lowest level, we recognize a person based on the sound of his/her voice (e.g., low/high pitch, bass, nasality, etc.). But we also use other types of information in the speech signal to recognize a speaker, such as a unique laugh, particular phrase usage, or speed of speech among other things.

Most current state-of-the-art automatic speaker recognition systems, however, use only the low level sound information (specifically, very short-term features based on purely acoustic signals computed on 10-20 ms intervals of speech) and ignore higher-level information. While these systems have shown reasonably good performance, there is much more information in speech which can be used and potentially greatly improve accuracy and robustness.

In this workshop we will look at how to augment the traditional signal-processing based speaker recognition systems with such higher-level knowledge sources. We will be exploring ways to define speaker-distinctive markers and create new classifiers that make use of these multi-layered knowledge sources. The team will be working on a corpus of recorded telephone conversations (Switchboard I and II corpora) that have been transcribed both by humans and by machine and have been augmented with a rich database of phonetic and prosodic features. A well-defined performance evaluation procedure will be used to measure progress and utility of newly developed techniques.


Team Members 
Senior Members
Walter AndrewsDoD
Joe CampbellMIT Lincoln Laboratory
Jiri NavratilIBM
Barbara PeskinICSI
Doug ReynoldsMIT Lincoln Laboratory
Graduate Students
Andre AdamiOGI
Qin JinCarnegie Mellon University
David KlusacekCharles University
Undergraduate Students
Joy AbramsonYork University
Radu MihaescuPrinceton University

Johns Hopkins University

Johns Hopkins University, Whiting School of Engineering

Center for Language and Speech Processing
Hackerman 226
3400 North Charles Street, Baltimore, MD 21218-2680

Center for Language and Speech Processing