Zhuo Chen (Microsoft)
Niko Brümmer (Omilia)
Marc Delcroix (NTT)
Jun Du (UTSC)
Hakan Erdogan (Google)
Keisuke Kinoshita (NTT)
Johan Rohdin (BUT)
Christoph Boeddeke (Paderborn University)
Tobias Cord-Landwehr (Paderborn University)
Pavel Denisov (University of Stuttgart)
Chengda Li (SJTU)
Jiachen Lian (CMU)
Yi Luo (Columbia)
Thilo von Neumann (Paderborn University)
Roshan Sharma (CMU)
Anya Silnova (BUT)
Wangyou Zhang (SJTU)
Katerina Zmolikova (BUT)
Lukáš Burget (BUT)
Najim Dehak (JHU CLSP)
Dimitrios Dimitriadis (Microsoft)
John Hershey (Google)
Zili Huang (JHU)
Jinyu Li (Microsoft)
Zhong Meng (Microsoft)
Nima Mesgarani (Columnia)
Tomohiro Nakatani (NTT)
Yanmin Qian (SJTU)
Leibny Garcia Perera(JHU)
Dani Romero (JHU HLTCOE)
Themos Stafylakis (Omilia)
Reinhold Hab-Umback (Parderborn)
Xiaofei Wang (Microsoft)
Takuya Yoshioka (Microsoft)
Tianyan Zhou (Microsoft)
Multi-talker conversational speech transcription using distant microphones is increasingly becoming an important application scenario in the speech industry. However, there are still many fundamental challenges that need to be overcome. Overlapped speech (and, equally importantly, quick turn-taking), which breaks the assumption of “one active person at a time”, is one of the long standing problems that have barely been addressed. Speech separation and extraction are extensively studied approaches for handling overlaps. The former separates each constituent speech signal which is then processed with speech recognition and speaker diarization. The latter approach starts with detecting speaker segments from overlapped speech to obtain speaker embeddings, followed by speaker-informed speech separation or recognition to extract the transcription for each speaker. While extensively studied in laboratory settings using pre-segmented utterances, their applications to real unsegmented multi-talker recordings are limited [1, 2]. Also, existing real world applications are based on modular approaches using separately trained subsystems for speech separation, speech recognition, and so on, which may result in sub-optimal solutions.
In this workshop, we propose hosting a team to pursue the following goals: 1. to build fully contained multi-talker audio transcription systems based on the two approaches mentioned earlier whilst investigating their relative merits with respect to overlap handling, speech recognition, and speaker diarization, 2. to explore end-to-end modeling for dealing with unsegmented multi-talker audio recordings within the framework of each approach, and 3. to explore the use of unlabeled data to further improve the aforementioned techniques in an unsupervised manner. The emphasis of the project is placed on building fully contained systems that deal with unsegmented conversational audio with no dependency on unrealistic assumptions such as the availability of speaker segmentation files.
We aim to organize a research team with outstanding researchers who have intensively worked on relevant areas, including speech separation, speech recognition, speaker diarization, unsupervised training, and end-to-end modeling. The team will put emphasis to successfully delivering the the systems, that will provide foundations for future research as well as to cross-fertilize ideas between the related areas by investigating interdisciplinary end-to-end approaches.
1. Speech recognition — Most ASR studies are aimed at recognizing a segmented utterance of a single speaker. Permutation invariant training based ASR  and speaker aware ASR  have been proposed to tackle the multitalker ASR problem, but they are still evaluated on segmented speech. End-to-end ASR that leverages a context obtained from multiple utterances is emerging  while still restricted to a single speaker recording.
2. Speech separation — The speech separation technology has been significantly improved in the past five years. For the single channel case, starting with deep clustering , consistent progress has been made, resulting in end-toend waveform-based speech separation as proposed in . Multi-channel approaches have also been explored .
3. Speaker segmentation & extraction — An alternative approach to the overlap problem is speaker extraction, which attempts to separate or recognize only utterances by a target speaker by using his/her audio snippet [1, 9].
4. Probabilistic speaker embeddings — Speaker diarization can be done by segmenting the speech into short segments, then extracting an embedding from each segment and then clustering the embeddings . This can optionally be followed by a resegmentation step . In this workshop we aim to concentrate on the embedding extraction and the clustering part of the diarization problem. Since embeddings extracted from low quality (noisy, or overlapped) speech segments can spoil the clustering, we propose to augment each embedding with a quality factor that is extracted in parallel with the embedding from each speech segment. Formally, the quality factor will be treated as a precision scaling factor that reflects the uncertainty about what the value of the embedding might have been if it had been extracted from a high quality segment. We already know how to utilize these quality factors, in a computationally efficient combination with PLDA, to compute likelihoods for clustering hypotheses . The research problem that remains for this workshop is to design and compare criteria for training the DNN that extracts the proposed probabilistic (quality-factor-augmented) embeddings.
5. Speaker diarizaiton — Neural network-based end-to-end approaches are emerging in speaker diarization [13, 14]. Also, a few recent speaker diarization systems are designed to handle utterance overlaps [15, 16].
6. Unsupervised Improvements — We will investigate ways to leverage untranscribed data in the following directions:
This list of suggested research directions and is certainly not an exhaustive one. The goal of the project is to address the hard problem of building robust speech recognition systems when there is no human annotation for matched training and development data. Neural networks, more accurate speech activity detection, multicondition training, speech enhancement, unsupervised adaptation, etc can all contribute to improving the overall SR performance.
These methods will pave the way for joint optimization of speaker diarization and other speech processing components.
Software Platform & Task Design
1. Dataset — we will provide two newly recorded datasets to facilitate research and allow for in-depth analysis of algorithms for handling utterance overlaps, one using a microphone array and one using spatially distributed microphones, called LibriCSS and LibriCSS-adhoc, respectively. The original unsegmented multi-talker recordings will be used.
2. Speech recognition — Both hybrid and end-to-end speech recognition systems will be considered. Open source systems such as PyKaldi2  and ESPnet  will be utilized.
3. Speaker embedding — A state-of-the-art speaker embedding extraction system  trained on the Vox Celeb dataset will be provided to bootstrap the diarization and speaker-informed speech separation work.
4. Probabilistic Embeddings — The work on probabilistic embeddings, clustering and diarization will proceed from baselines provided by various existing proprietary and open-source code bases for speaker embeddings, PLDA scoring, clustering and diarization. In addition to the above-mentioned new meeting databases, many existing databases for speaker recognition are also available for training relevant parts of the system.
In addition, an evaluation pipeline for multi-talker continuous speech recognition will also be provided to allow participants to immediately start research investigations. Open-source implementations will also be exploited wherever appropriate.
Task design — The target of this team is to address the speech recognition and speaker diarization for unsegmented multi-talker speech recordings. We will evaluate novel architectures/techniques for this task, including the two preliminary setups, as well as novel end-to-end solutions. The unsupervised model adaptation from model trained from pre-segmented data is also an possible method to improve the baseline. Different recording setups will also be considered, such as microphone arrays and spatially distributed microphones. WER and speaker-attributed WER will be the primary performance metrics.