Leonardo Badino (Italian Institute of Technology) – Speech Production Features for Deep Neural Network Acoustic Modeling

When:
November 17, 2015 @ 12:00 pm – 1:15 pm
2015-11-17T12:00:00-05:00
2015-11-17T13:15:00-05:00
Where:
Hackerman B17
3400 N Charles St
Baltimore, MD 21218
USA
Abstract
In the last few years DNNs have become the dominant technique for acoustic modeling in automatic speech recognition (ASR). The diverse set of approaches proposed to further improve ASR performance includes DNN- based acoustic modeling that uses speech production knowledge (SPK), i.e., information about how the vocal tract produces speech sounds.
While standard acoustic modeling already relies on some phonological SPK binary features (e.g., fricative) to model phonetic context and define the DNN targets, more explicit uses of SPK for DNN acoustic model training can be explored.
In this talk I will be presenting two SPK-based approaches. The first approach relies on measurements of vocal tract movements to extract new acoustic features that are appended to the DNN input vector. The second approach extracts continuous valued SPK features from binary phonological features which are then used to build a structured output for the DNN. The two approaches, tested on mngu0 and TIMIT datasets, show a consistent phone recognition error reduction over a baseline that does not use SPK.
Biography

Leonardo Badino is a postdoctoral researcher at the Italian Institute of Technology (ITT), Genova, Italy.  He received a M.Sc. degree in Electronic Engineering from the University of Genova in 2000 and a PhD in Computer Science from the University of Edinburgh, UK, in 2010.  From 2001 to 2006 he worked as a software engineer and project manager at Loquendo, a speech technology company.  During his PhD he worked on prosodic prominence detection and generation for text-to-speech synthesis.  He is currently working on speech production knowledge for ASR, limited resources ASR and computational analysis of non-verbal sensorimotor communication.

Center for Language and Speech Processing