Mark Gales
July 8th
10:30AM
CSEB Room B17
"Sequence Kernels for Speaker and Speech Recognition "
Workshops
Vocal Aging Explained by Vocal Tract Modeling
This project aims at exploring vocal tract aging via vocal tract modeling. We will esti- mate vocal-tract configurations underlying speech recorded by people at different ages through analysis-by-synthesis based articulator inversion and attempt to estimate analytically character- ized transformations that describe how these configurations change with age. The transforma- tions will be corraborated through information culled from existing medical and physiological studies of aging and the vocal tract, where possible. The estimated transformations can then be used to predict age-related changes in speech spectra of target speakers, or to detect medically relevant variations of speech signals from their expected patterns.
As a related component task, we will develop analysis-by-synthesis based features that may be used for speech recognition, as well as analysis-by-synthesis based statistical techniques to learn models that relate phonemes to articulator configurations.
After extensive preparation of data, tools, and team during the pre-workshop period we will focus in our 6-week workshop effort on accomplishing the following, clearly measurable targets:
- Estimating phoneme-specific Vocal Tract Parameters by Analysis-by-Synthesis. For syn-
thesis we will use existing models that simulate the physics of the vocal tract, in order to
relate the latter to the observed signal. Specifically, for this procedure we will:
- Implement an array of vocal tract models, henceforth referred to as the "Vocal-Tract Model Array" that represents a number of valid vocal tract configurations obtained from actual measurements of the vocal tract.
- Synthesize speech from each vocal tract in this array using excitation derived from the incoming speech signal.
- Compare the synthesized speech to the incoming speech to generate a vector of distortion values every 10ms. Each component of this vector will represent the error between the signal generated by one vocal tract in the vocal-tract model array and the incoming signal.
- Implement a dynamic programming technique that minimizes total distortion will be used to obtain complete trajectories of vocal tract configurations in a manner that maintains consistency across different instances of any phoneme, thereby deriving phoneme-specific vocal-tractmeasurements (configurations and trajectories). We refer to these measurements as "VTM" below.
- Given a set of estimated vocal tract configurations V (age, phoneme) at different
epochs (ages) for different phonemes for one or more speakers, derive regressions G(·)
that relate vocal tract configurations to age:
V (age, phoneme) = G(age, phoneme) * V (age′, phoneme).
- Time permitting, we will also extend the above procedure to estimate tissue-related parameters such as compliance, that might also be related to age.
-
Proof of principle. Here we will establish that age-related patterns can indeed be discerned
in the audio and in the corresponding vocal tract measurements for subjects. To do so we
will:
- Obtain data-driven proof of age dependency of vocal tract anatomy using an MFCC- based age classifier.
- Establish that that the VTMs reflect aging directly by comparing age-related vari- ations in estimated parameters to those generally expected from medical and phe- nomenological studies.
- Provide visualizationof the aging process in a distance-conserving low dimensional representation of the estimated vocal tract parameters.
- Exploitation of Anatomical Model. Based on the outcome of (1) and (2), we will also
address the following issues:
- Use of VTM vector stream to classify age and to analyse aging processes.
- Use of VTM vector stream to predict disease stages that affect speech production, by identifying speech recordings for which the estimated vocal-tract parameters differ significantly from those predicted by the transformation G(·) from other recordings of the speaker obtained at other times.
- Develop an "analysis-by-synthesis feature vector" from low-dimensional projections of the distortion vector that is computed during the analysis, for speech recognition.
- We will also develop an approach to use the estimated phoneme-specific VTMs for analysis-by-synthesis based speech recognition.
Team Members
Team Leader | |||
|      | Elmar Noeth | noeth at informatik dot uni-erlangen dot de | University of Erlangen |
Senior Personnel | |||
| Peter Beyerlein | peter dot beyerlein at tfh-wildau dot de | University of Applied Sciences Wildau | |
| Georg Stemmer | georg dot stemmer at siemens dot com | Siemens AG | |
Graduate Students | |||
| Andrew Cassidy | acassidy at jhu dot edu | Johns Hopkins University | |
| Eva Lasarcyk | evaly at coli dot uni-saarland dot de | Saarland University | |
| Blaise Potard | v1bpotar at inf dot ed dot ac dot uk | LORIA | |
| Werner Spiegel | spiegl at immd5 dot informatik dot uni-erlangen dot de | University of Erlangen | |
| Puyang Xu | puyangxu at gmail dot com | Johns Hopkins University | |
Undergraduate Students | |||
| Young Chol Song | nskystars at gmail dot com | Stony Brook University | |
| Stephen Shum | sshum at berkeley dot edu | University of California, Berkeley | |
| Varada Kolhatkar | kolha002 at d dot umn dot edu | University of Minnesota, Duluth | |
Affiliates | |||
| Andreas Andreou | andreou at jhu dot edu | Johns Hopkins University | |
| Nemala Sridhar Krishna | siris dot krishna at gmail dot com | Johns Hopkins University | |


