Large labeled data sets are a prerequisite for the successful application of AI models. However, target labels are often expensive and difficult to obtain, while unlabeled data is abundant. Partial annotations or metadata are often readily available, and while they are not sufficient to allow for supervised learning approaches, they provide nonetheless valuable information to supervise from the distance the downstream tasks.
We propose to work on learning representations of speech and handwriting recognition that will separate their relevant contents from other properties, such as speaker/scribe, historical period and style, or noise/background. The rationale is twofold. First, many important tasks will benefit from better representations, among which are speech recognition for low-resource languages and universal accessibility to old handwritten documents. Second, the proposal tackles the important open research question of understanding which properties of a learning algorithm give rise to disentangled representations.
During the workshop, we will apply sequence-to-sequence auto-encoders with autoregressive (language-model like) decoders to two problem domains: speech (sequences of samples or feature frames) and handwriting (sequences of baseline-aligned images). We will train the autoencoders to separate the information contained in the sequences into 1. information that is easy to infer in an autoregressive manner from recent past, 2. global conditioning on e.g. speaker or scribe, 3. language-like information that reflects contents of speech or document. To enforce this separation we will use Variational Autoencoding techniques , adversarial domain adaptation , and probability distributions matching  and HMM priors .
The models will be evaluated in two ways:
The proposal brings together a team with complementary skills in deep learning, low-resource language processing and old-document recognition which will ensure a successful and fruitful collaboration.