Helin Wang (JHU) “Solo Audio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer”

When:
September 27, 2024 @ 12:00 pm – 1:15 pm
2024-09-27T12:00:00-04:00
2024-09-27T13:15:00-04:00
Where:
Hackerman Hall B17
3400 N CHARLES ST
Baltimore
MD 21218
Cost:
Free

Abstract

SoloAudio is an innovative diffusion-based generative model designed for target sound extraction (TSE). It trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio enables both audio-oriented and language-oriented TSE by employing the CLAP model as a feature extractor for target sounds. Additionally, it leverages synthetic audio generated by state-of-the-art text-to-audio models for training, showcasing strong generalization to out-of-domain data and novel sound events. SoloAudio demonstrates impressive zero-shot and few-shot learning capabilities.

Center for Language and Speech Processing