Helin Wang (JHU) “Solo Audio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer”
Abstract SoloAudio is an innovative diffusion-based generative model designed for target sound extraction (TSE). It trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features.[…]