Using Cooperative Ad-hoc Microphone Arrays for ASR

2019 Sixth Frederick Jelinek Memorial Summer Workshop

 

The traditional pipeline architecture (left) of a speech recognition system that is based on pre-processing steps for speaker localization and speech enhancement; an example of cooperative neural architecture (right), characterized by a full communication between its three components

Distant-speech recognition (DSR) is still a very challenging task due to variabilities introduced by noisy, reverberant, highly non-stationary, and unpredictable environments [1]. Despite a significant progress recently obtained using deep learning, research benchmarking activities, e.g., CHiME5 [2], and projects, e.g., DIRHA (see http://dirha.fbk.eu), made evident that microphone arrays distributed in space are effective to capture and decode the acoustic scene, but many problems remain open, especially in case of large speaker-array distances, multiple overlapping speakers, and spontaneous speech. Another issue concerns different sampling clock speeds, and possible limited knowledge of microphone positions, which characterize ad-hoc microphone arrays distributed in space. For CHiME5, indeed, a specific pre-processing and manual annotation were necessary for time alignment among speech segments, produced by different speakers and which were acquired by different Kinect devices.

We are currently investigating a GCC-PHAT based approach for synchronization across devices, which is crucial for an effective application of beamforming, enhancement, and other front-end processing techniques. The workshop aims to expand this work, with the main goal of developing fully automatic methods to remedy clock drifts, and of evaluating their impact on speaker activity alignment as well as on DSR performance.

Besides a possible clock speed mismatch among input devices, other problems need to be addressed at the front-end processing level, when using microphone arrays for DSR. Classical approaches are based on the application of multi-channel processing techniques that aim to analyze the acoustic scene, perform tasks such as speech activity detection, speaker localization and ID, beamforming, speech separation and enhancement [4], and eventually feed the back-end speech recognition component for decoding. However, a limitation of such systems lies in the lack of matching and communication between the technologies that concern both front-end and back-end steps. End-to-end methods represent a very promising direction for this goal, though they require the availability of adequate computational resources, and of extremely large and representative corpora, to obtain an effective jointly-trained framework. Towards this direction, we have recently developed a novel architecture based on a cooperative network of deep neural networks [5], where all the components are jointly trained and better cooperate with each other thanks to a full communication scheme. Moreover, we recently introduced SincNet, a novel neural architecture able to directly process raw audio waveforms in an efficient way using sinc-based convolutional filters. SincNet learns filters tuned on the addressed task, for instance, speaker classification or noisy speech recognition. During the workshop, our goal is to explore the use of SincNet and cooperative neural frameworks to jointly train front-end and back-end neural models, e.g., three networks referred to the above-mentioned clock drift mitigation, to speaker classification, and to DSR acoustic modeling, respectively.

Finally, a limitation of standard DSR systems is that they are trained based on a fully-supervised approach. Methods to learn meaningful acoustic and speech representations in an unsupervised fashion could be very useful to make a deep learning system more robust, especially when addressing challenging scenarios. As emerged in some recent works, e.g. [8], unsupervised learning-based systems can be used to learn high-level abstract representations. During the workshop, we plan to investigate on a cooperative neural framework which will include one or more networks trained in a semi-supervised way, to model the above-mentioned tasks.

 

Expected outcomes and impact

By the end of the workshop, we expect to have developed cooperative neural-based solutions to remedy clock drifts using ad-hoc microphone arrays for DSR, as well as alternative solutions based on standard state-of-the-art techniques. A performance comparison, using different training and test data sets, will show the pros and cons of each approach, and a better understanding of how this problem can be best addressed. Though we will primarily spend our efforts on offline experiments, we will also take into account of the relevance of the synchronization problem under real-world conditions, i.e., for real-time applications: during the last part of the workshop, we plan to analyze the possible scalability towards online settings for the most viable techniques.

Besides this, we think that another significant impact of the workshop will concern the investigation of semi-supervised neural approaches, which represent progress towards solutions based on fully unsupervised learning.

Finally, an impact of the workshop regards the public distribution of data sets and of recipes in PyTorch-Kaldi, which can be very useful to the scientific community, both for comparison purposes and for starting similar studies. As a side result of our action, we finally plan to produce an automatic annotation of time-aligned speaker activity, with corpora such as CHiME5, which could represent a relevant input for the next CHiME6 challenge.

Additional info for undergrads

A fundamental step to accomplish before the workshop will be the creation of an experimental framework that will consist of clearly defined tasks, related audio corpora to use for system training and test, as well as of evaluation criteria. The corpora will include both real and simulated multi-microphone signals and the related annotation files.

In this way, at the beginning of the workshop, one or more baseline systems will be available for each experimental task, with related source code, performance on one or more corpora, how-to-run instructions, and other relevant documentation. Some of these baselines will be presented during summer school.

All of these preliminary steps will help speed up the first activities of the workshop.

 

Then, a major goal of the workshop will be to progress on each scientific topic, based on clear evidence of each technical advance that will be represented by an increase of performance over the baseline. To this purpose, the team will be divided into three groups that will tackle the given challenges according to different approaches, strategies, and techniques. For instance, one group will work on solutions based on cooperative and semi-supervised neural frameworks, while another group will tackle the same challenges by adopting a more traditional approach based on noncooperative neural frameworks

The two above-mentioned groups will be more active on speech transcription tasks, while a third group will better focus the automatic description of a multi-microphone scene, for instance, to establish when a single speaker has been active or when two speakers have talked simultaneously, and, in general, aiming to produce an automatic annotation of acoustic activities that characterize the scene.

Common aspects for all the groups will be that data were captured by a set of microphone array devices distributed in space, with each device operating based on a different clock (i.e. ad-hoc microphone arrays), in other words with its sampling rate that slightly differs from the nominal rate (e.g., 16000 Hz), in an unknown way. This is a variability we aim to eventually solve in an unsupervised manner. The related technical problem represents one of the most complex challenges addressed by this workshop.

Another common aspect regards the deep learning framework that will be used. All the groups will mainly rely on the recently-released PyTorch-Kaldi toolkit (https://github.com/mravanelli/pytorch-kaldi), that will be progressively updated before and during the workshop to better address the challenging tasks proposed in this project.”

In conclusion, the research groups will be constantly interacting one to another during the whole workshop. Meetings and seminars will be organized every week in order to analyze the progress of each team and reschedule activities, when necessary.

At the beginning of the workshop, each undergraduate student will join one of these groups, based on her/his background and software skill, and will be assigned a set of tasks to run together with other group members.

 

 

Team Leader

Maurizio Omologo (FBK, Italy)

Senior Members

Mirco Ravanelli (MILA, Canada)

Alessio Brutti (FBK, Italy)

Pawel Swietojanski (University of New South Wales, Australia)

Graduate Students

Santiago Pascual de la Puente (UPC, Barcelona, Spain)

Tobias Menne (Aachen University, Germany)

Sunit Sivasankaran (INRIA -Nancy, France)

Manuel Pariente (Université de Lorraine, France)

Tina Raissi (RWTH Aachen University, Germany )

Samuele Cornell (Polytechnic University of Marche, Italy)

João Monteiro (INRS-Université du Quebec, Canada)

Undergraduate Students

Jianyuan Zhong (University of Rochester)

Yue Yin (Carnegie Mellon University)

Senior Affiliates (Part-time Members)

Stefano Squartini (Polytechnic University of Marche, Italy)

Emmanuel Vincent (INRIA Nancy – Grand Est, France)

Jan Trmal (Johns Hopkins University)

 

 

Johns Hopkins University

Johns Hopkins University, Whiting School of Engineering

Center for Language and Speech Processing
Hackerman 226
3400 North Charles Street, Baltimore, MD 21218-2680

Center for Language and Speech Processing