Universal Speech Content Factorization – Henry Li Xinyuan (JHU)

When:
March 13, 2026 @ 12:00 pm – 1:15 pm
2026-03-13T12:00:00-04:00
2026-03-13T13:15:00-04:00
Where:
Hackerman Hall B17
Cost:
Free

Abstract

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization (SCF), a closed-set voice conversion method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot voice conversion system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that USCF features can serve as an alternative acoustic representation for text-to-speech, offering a linear, training-efficient substitute for timbre-prompted SSL-based systems.

Also Available by Zoomhttps://wse.zoom.us/j/96735183473

Center for Language and Speech Processing