Low-dimensional speech representation based on Factor Analysis and its applications – Najim Dehak (MIT)
We introduce a novel approach to data-driven feature extraction stemming from the field of speaker recognition. In the last five years, statistical methods rooted in factor analysis have greatly enhanced the traditional representation of a speaker using Gaussian Mixture Models (GMMs). In this talk, we build some intuition by outlining the historical development of these methods and then survey the variety of applications made possible by this approach. To begin, we discuss the development of Joint Factor Analysis (JFA), which was motivated by a desire to both model speaker variabilities and compensate for channel/session variabilities at the same time. In doing so, we introduce the notion of a GMM supervector, a high-dimensional vector created by concatenating the mean vectors of each GMM component. JFA assumes that this supervector can be decomposed into a sum of two parts: one containing relevant speaker-specific information and another containing channel-dependent nuisance factors that need to be compensated. We will describe the methods used to estimate these hidden parameters. The success of JFA led to a proposed simplification using just factor analysis for the extraction of speaker-relevant features. The key assumption here is that most of the variabilities between GMM supervectors can be explained by a (much) lower-dimensional space of underlying factors. In this approach, a given utterance of any length is mapped into a single, low-dimensional “total variability” space. We call the resulting vector an i-vector, short for “identity vector” in the speaker recognition sense or “intermediate vector” for its intermediate size between that of a supervector and that of an acoustic feature vector. Unlike in JFA, the total variability approach makes no distinction between speaker and inter-session variabilities in the high-dimensional supervector space; instead, channel compensation occurs in the lower-dimensional i-vector space. The presentation will provide an outline of the process that can be used to build a robust speaker verification system. Though originally proposed for speaker modeling, the i-vector representation can be seen more generally as an elegant framework for data-driven feature extraction. After covering the necessary background theory, we will discuss our recent work in applying this approach to a variety of other audio classification problems, including speaker diarization and language identification.
Najim Dehak received his Engineering degree in Artificial Intelligence in 2003 from Universite des Sciences et de la Technologie d’Oran, Algeria, and his MS degree in Pattern Recognition and Artificial Intelligence Applications in 2004 from the Universite de Pierre et Marie Curie, Paris, France. He obtained his Ph.D. degree from Ecole de Technologie Superieure (ETS), Montreal in 2009. During his Ph.D. studies he was also with Centre de recherche informatique de Montreal (CRIM), Canada. In the summer of 2008, he participated in the Johns Hopkins University, Center for Language and Speech Processing, Summer Workshop. During that time, he proposed a new system for speaker verification that uses factor analysis to extract speaker-specific features, thus paving the way for the development of the i-vector framework. Dr. Dehak is currently a research scientist in the Spoken Language Systems (SLS) Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests are in machine learning approaches applied to speech processing and speaker modeling. The current focus of his research involves extending the concept of an i-vector representation into other audio classification problems, such as speaker diarization, language- and emotion-recognition.