Distant supervision for representation learning in speech and handwriting

2019 Sixth Frederick Jelinek Memorial Summer Workshop

Large labeled data sets are a prerequisite for the successful application of AI models. However, target labels are often expensive and difficult to obtain, while unlabeled data is abundant. Partial annotations or metadata are often readily available, and while they are not sufficient to allow for supervised learning approaches, they provide nonetheless valuable information to supervise from the distance the downstream tasks.

We propose to work on learning representations of speech and handwriting recognition that will separate their relevant contents from other properties, such as speaker/scribe, historical period and style, or noise/background. The rationale is twofold. First, many important tasks will benefit from better representations, among which are speech recognition for low-resource languages and universal accessibility to old handwritten documents. Second, the proposal tackles the important open research question of understanding which properties of a learning algorithm give rise to disentangled representations.

During the workshop, we will apply sequence-to-sequence auto-encoders with autoregressive (language-model like) decoders to two problem domains: speech (sequences of samples or feature frames) and handwriting (sequences of baseline-aligned images). We will train the autoencoders to separate the information contained in the sequences into 1. information that is easy to infer in an autoregressive manner from recent past, 2. global conditioning on e.g. speaker or scribe, 3. language-like information that reflects contents of speech or document. To enforce this separation we will use Variational Autoencoding techniques [1], adversarial domain adaptation [2], and probability distributions matching [3] and HMM priors [4].

The models will be evaluated in two ways:

  1. On downstream tasks, such as keyword spotting or speech/handwriting recognition. Using these tasks we will establish the data efficiency of unsupervised and supervised methods, assessing, for each technique, what is the equivalent amount of labeled data that would result in the similar performance of the model.
  2. On special tasks, such ABX comparisons [5], developed to quantify the quality of disentanglement of information in the latent representation. These tasks will help to understand which design choices impact the properties of latent representations. In preparation for the workshop, we will establish datasets and prepare baseline solutions. During the workshop, a single codebase will be developed for both speech and handwriting. We will try several distant supervision approaches and in concurrent task design, the full evaluation suite: (semi)supervised solutions to the targeted problems (e.g. [6] for handwriting) and specialized tests on real and synthetic data.

The proposal brings together a team with complementary skills in deep learning, low-resource language processing and old-document recognition which will ensure a successful and fruitful collaboration.

 

Team Leader

Jan Chorowski (University of Wroclaw, Poland)

Senior Members

Ricard Marxer (University of Toulon, France)

Antoine Laurent (LIUM Universite Le Mans, France)

Hans Dolfing (Independent Researcher)

Graduate Students

Salima Mdhaffar (LIUM Universite Le Mans, France)

Guillaume Sanchez (University of Toulon)

Nanxin Chen (JHU)

Sameer Khurana (MIT)

Adrian Łańcucki (University of Wroclaw)

Senior Affiliates (Part-time Members)

Jerome Bellegarda (Apple)

Tanel Alumäe (Tallinn University of Technology)

 

 

Johns Hopkins University

Johns Hopkins University, Whiting School of Engineering

Center for Language and Speech Processing
Hackerman 226
3400 North Charles Street, Baltimore, MD 21218-2680

Center for Language and Speech Processing