Identifying individuals based on their speech is an important component
technology in many application, be it automatically tagging speakers in
the transcription of a board-room meeting (to track who said what), user
verification for computer security or picking out a known terrorist or
narcotics trader among millions of ongoing satellite telephone calls.
How do we recognize the voices of the people we know? Generally, we use
multiple levels of speaker information conveyed in the speech signal. At
the lowest level, we recognize a person based on the sound of his/her
voice (e.g., low/high pitch, bass, nasality, etc.). But we also use other
types of information in the speech signal to recognize a speaker, such as
a unique laugh, particular phrase usage, or speed of speech among other
things.
Most current state-of-the-art automatic speaker recognition systems,
however, use only the low level sound information (specifically, very
short-term features based on purely acoustic signals computed on 10-20 ms
intervals of speech) and ignore higher-level information. While these
systems have shown reasonably good performance, there is much more
information in speech which can be used and potentially greatly improve
accuracy and robustness.
In this workshop we will look at how to augment the traditional
signal-processing based speaker recognition systems with such higher-level
knowledge sources. We will be exploring ways to define
speaker-distinctive markers and create new classifiers that make use of
these multi-layered knowledge sources. The team will be working on a
corpus of recorded telephone conversations (Switchboard I and II corpora)
that have been transcribed both by humans and by machine and have been
augmented with a rich database of phonetic and prosodic features. A
well-defined performance evaluation procedure will be used to measure
progress and utility of newly developed techniques.