Speaker and Language Recognition

In the summer of 2013, CLSP hosted a 4-week workshop to explore new challenges in speaker and language recognition. A group of 16 international researchers came together to collaborate in a set of research areas described below. The workshop was motivated by the successful outcomes of the 2008 CLSP summer workshop and the BOSARIS workshops of 2010 and 2012.

The workshop was sponsored by the individual funds of the participants and by Google research.

Research areas

Domain adaptation for speaker recognition
+ Motivation: Advances in subspace modeling, specifically the i-vector approach, have demonstrated dramatic and consistent improvement in speaker recognition performance on the NIST speaker recognition evaluations over the past 4 years. However, these techniques are highly-dependent on having access to large amounts of labeled training data from thousands of speakers each making tens of calls to train the hyper-parameters (UBM, total-variability matrix, within and between covariance matrices). The archive of past LDC data collections has provided such a set of data for the NIST SREs and been used effectively. However, it is highly unrealistic to expect such a large set of labeled data from matched conditions when applying a speaker recognition system to a new application. Thus there is a need to focus research efforts on how to use unlabeled data for adapting and applying i-vector speaker recognition systems.

+ Resources:

– Domain Adaptation Challenge (DAC) description [pdf].

– If you want to start from the audio (assuming you have the data), here are the lists of files for the DAC.

– If you want to start from i-vectors. [link]

– Matlab script that shows how to use the i-vectors [example_DAC_cosine.m].

– Running a Gaussian PLDA system (like this) on the i-vectors above produces the following results.

+ List of publications:

– Stephen Shum, Douglas Reynolds, Daniel Garcia-Romero, and Alan McCree, “UNSUPERVISED CLUSTERING APPROACHES FOR DOMAIN ADAPTATION IN SPEAKER RECOGNITION SYSTEMS”, Odyssey, 2014.

– Daniel Garcia-Romero and Alan McCree, “SUPERVISED DOMAIN ADAPTATION FOR I-VECTOR BASED SPEAKER RECOGNITION”, ICASSP, 2014.

– Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer, and Carlos Vaquero, “UNSUPERVISED DOMAIN ADAPTATION FOR I-VECTOR SPEAKER RECOGNITION”, Odyssey, 2014.

– Hagai Aronowitz, “INTER DATASET VARIABILITY COMPENSATION FOR SPEAKER RECOGNITION”, ICASSP, 2014.

– Hagai Aronowitz “COMPENSATING INTER-DATASET VARIABILITY IN PLDA HYPER-PARAMETERS FOR ROBUST SPEAKER RECOGNITION”, Odyssey, 2014.

– Jesus Villalba, Eduardo Lleida, “UNSUPERVISED ADAPTATION OF PLDA BY USING VARIATIONAL BAYES METHODS”, ICASSP, 2014.

Unsupervised score calibration
+ Motivation: When a speaker recognizer is deployed in a new environment, which may differ from previously seen environments w.r.t. factors like language, demographics, vocal effort, noise level, microphone, transmission channel, duration, etc., the behaviour of the scores may change. Although the scores can still be expected to discriminate between targets and non-targets in the new environment, score distributions could change between environments. If scores are to be used to make hard decisions, then we need to calibrate the scores for the appropriate environment. To date, most works on calibration have made use of supervised data. Here, we explore the problem of calibration where our only resource is a large database of completely unsupervised scores.

+ List of publications:

– Niko Brummer and Daniel Garcia-Romero, “GENERATIVE MODELLING FOR UNSUPERVISED SCORE CALIBRATION”, ICASSP, 2014.

Deep neural networks for language recognition
+ Motivation: Deep Neural Networks have recently proved to be successful in challenging machine learning applications such as acoustic modelling, visual object recognition and many other; especially when large amount of training data is available. Motivated by those results and also by the discriminative nature of DNNs, which could complement the i-vector generative approach, we adapt DNNs to work at acoustic frame level to perform Language Identification. Particularly, in this work, we build, explore and experiment the use of several DNNs configurations and compare the obtained results with several state-of-the-art i-vector based systems trained from the same acoustic features.

+ List of publications:

– Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno, “AUTOMATIC LANGUAGE IDENTIFICATION USING DEEP NEURAL NETWORKS”, ICASSP, 2014.

JFA-based front ends for speaker recognition
+ Motivation: Overcome some of the limitations of the i-vector representation of speech segments by exploiting Joint Factor Analysis (JFA) as an alternative feature extractor. The work addresses both text-independent and text-dependent speaker recognition.

+ List of publications:

– Patrick Kenny, Themos Stafylakis, Jahangir Alam, Pierre Ouellet, “JFA-BASED FRONT ENDS FOR SPEAKER RECOGNITION”, ICASSP, 2014.

Vector Taylor Series (VTS) for i-vector extraction
+ Motivation: i-vector speaker recognition systems achieve good permorfance in clean environments. The goal is to adapt the i-vector approach for noisy conditions, where the accuracy of the systems is degraded. Our solutions are based on VTS and unscented transform (UT). We have adopted the simplified VTS recently proposed by (Yun Lei et al., 2013), and studied a new approach based on UT that allows a more accurate modelling of nonlinearities. The last is especially useful for very low SNRs.

+ List of publications:

– David Martinez, Lukas Burget, Themos Stafylakis, Yun Lei, Patrick Kenny, and Eduardo Lleida, “UNSCENTED TRANSFORM FOR IVECTOR-BASED NOISY SPEAKER RECOGNITION”, ICASSP, 2014.

Team Members
Affiliate Members
Hagai Aronowitz IBM,Israel
Niko Brummer AGNITIO, South Africa
Lukas Burget Brno University of Technology, Czech Republic
Sandro Cumani Brno University of Technology, Czech Republic
Najim Dehak MIT, USA
Daniel Garcia-Romero Johns Hopkins University, USA
Javier Gonzalez Dominguez Universidad Autonoma de Madrid, Spain
Patrick Kenny CRIM, Canada
Ignacio Lopez Moreno Google, USA
David Martinez Universidad de Zaragoza, Spain
Oldrich Plchot Brno University of Technology, Czech Republic
Themos Stafylakis CRIM, Canada
Albert Swart AGNITIO, South Africa
Carlos Vaquero AGNITIO, Spain
Karel Vesely Brno University of Technology, Czech Republic
Jesus Villalba Universidad de Zaragoza, Spain

Center for Language and Speech Processing