The Center for Language and Speech Processing




About CLSP
About CLSP
Upcoming Seminar

Mark Gales
July 8th
10:30AM
CSEB Room B17
"Sequence Kernels for Speaker and Speech Recognition "

More information »

Workshops

Vocal Aging Explained by Vocal Tract Modeling

This project aims at exploring vocal tract aging via vocal tract modeling. We will esti- mate vocal-tract configurations underlying speech recorded by people at different ages through analysis-by-synthesis based articulator inversion and attempt to estimate analytically character- ized transformations that describe how these configurations change with age. The transforma- tions will be corraborated through information culled from existing medical and physiological studies of aging and the vocal tract, where possible. The estimated transformations can then be used to predict age-related changes in speech spectra of target speakers, or to detect medically relevant variations of speech signals from their expected patterns.

As a related component task, we will develop analysis-by-synthesis based features that may be used for speech recognition, as well as analysis-by-synthesis based statistical techniques to learn models that relate phonemes to articulator configurations.

After extensive preparation of data, tools, and team during the pre-workshop period we will focus in our 6-week workshop effort on accomplishing the following, clearly measurable targets:

  1. Estimating phoneme-specific Vocal Tract Parameters by Analysis-by-Synthesis. For syn- thesis we will use existing models that simulate the physics of the vocal tract, in order to relate the latter to the observed signal. Specifically, for this procedure we will:
    • Implement an array of vocal tract models, henceforth referred to as the "Vocal-Tract Model Array" that represents a number of valid vocal tract configurations obtained from actual measurements of the vocal tract.
    • Synthesize speech from each vocal tract in this array using excitation derived from the incoming speech signal.
    • Compare the synthesized speech to the incoming speech to generate a vector of distortion values every 10ms. Each component of this vector will represent the error between the signal generated by one vocal tract in the vocal-tract model array and the incoming signal.
    • Implement a dynamic programming technique that minimizes total distortion will be used to obtain complete trajectories of vocal tract configurations in a manner that maintains consistency across different instances of any phoneme, thereby deriving phoneme-specific vocal-tractmeasurements (configurations and trajectories). We refer to these measurements as "VTM" below.
    • Given a set of estimated vocal tract configurations V (age, phoneme) at different epochs (ages) for different phonemes for one or more speakers, derive regressions G(·) that relate vocal tract configurations to age:

      V (age, phoneme) = G(age, phoneme) * V (age′, phoneme).
    • Time permitting, we will also extend the above procedure to estimate tissue-related parameters such as compliance, that might also be related to age.
  2. Proof of principle. Here we will establish that age-related patterns can indeed be discerned in the audio and in the corresponding vocal tract measurements for subjects. To do so we will:
    • Obtain data-driven proof of age dependency of vocal tract anatomy using an MFCC- based age classifier.
    • Establish that that the VTMs reflect aging directly by comparing age-related vari- ations in estimated parameters to those generally expected from medical and phe- nomenological studies.
    • Provide visualizationof the aging process in a distance-conserving low dimensional representation of the estimated vocal tract parameters.
  1. Exploitation of Anatomical Model. Based on the outcome of (1) and (2), we will also address the following issues:
    • Use of VTM vector stream to classify age and to analyse aging processes.
    • Use of VTM vector stream to predict disease stages that affect speech production, by identifying speech recordings for which the estimated vocal-tract parameters differ significantly from those predicted by the transformation G(·) from other recordings of the speaker obtained at other times.
    • Develop an "analysis-by-synthesis feature vector" from low-dimensional projections of the distortion vector that is computed during the analysis, for speech recognition.
    • We will also develop an approach to use the estimated phoneme-specific VTMs for analysis-by-synthesis based speech recognition.

Team Members

Team Leader
    Elmar Noethnoeth at informatik dot uni-erlangen dot deUniversity of Erlangen
Senior Personnel
Peter Beyerleinpeter dot beyerlein at tfh-wildau dot deUniversity of Applied Sciences Wildau
Georg Stemmergeorg dot stemmer at siemens dot comSiemens AG
Graduate Students
Andrew Cassidyacassidy at jhu dot eduJohns Hopkins University
Eva Lasarcykevaly at coli dot uni-saarland dot deSaarland University
Blaise Potardv1bpotar at inf dot ed dot ac dot ukLORIA
Werner Spiegelspiegl at immd5 dot informatik dot uni-erlangen dot deUniversity of Erlangen
Puyang Xupuyangxu at gmail dot comJohns Hopkins University
Undergraduate Students
Young Chol Songnskystars at gmail dot comStony Brook University
Stephen Shumsshum at berkeley dot eduUniversity of California, Berkeley
Varada Kolhatkarkolha002 at d dot umn dot eduUniversity of Minnesota, Duluth
Affiliates
Andreas Andreouandreou at jhu dot eduJohns Hopkins University
Nemala Sridhar Krishnasiris dot krishna at gmail dot comJohns Hopkins University