Leveraging Large Speech Language Models as Evaluators for Expressive Speech – Bismarck Odoom (JHU)
Abstract
Expressive speech generation aims to produce speech that conveys not only linguistic content but also nuanced emotional and stylistic information. However, evaluating the expressiveness of the generated speech remains a challenging problem, often relying on expensive human listening tests. We propose using large speech language models (SLMs) trained on speech and text data, as automatic evaluators for various aspects of expressive speech, such as emotion, gender, emotional intensity, valence, dominance, arousal, accent, and speak rate. We leverage the speech perception and understanding capabilities of existing large SLMs and fine-tune them to produce natural language evaluation of expressive attributes in speech, providing a scalable alternative to traditional evaluation methods.
Bio
Bismarck Odoom is a fourth year CS PhD student at CLSP at Johns Hopkins University advised by Philipp Koehn. His primary research interest focuses on Speech Translation and Multimodal LLMs.
Also Available by Zoom: https://wse.zoom.us/j/96735183473