Abstract:Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct a unified model capable of handling all major audio source separation tasks. Specifically, although existing multi - task audio source separation models can perform well on multiple tasks, they encounter difficulties when dealing with tasks with contradictory goals. For example, in music source separation (MSS), instruments need to be separated, while in cinematic audio source separation (CASS), instruments need to be grouped together. Therefore, designing a model that can handle these contradictory tasks and flexibly respond to different separation requirements is a challenge. To solve this problem, the authors propose a **Task - Aware Unified Source Separation (TUSS) model**. This model uses learnable prompts to specify the sources to be separated and adjusts its behavior according to different prompts. The key features of the TUSS model include: 1. **Variable number of prompts**: The model can accept any number of prompts, enabling it to handle different numbers of output sources. 2. **Conditional separation module**: Through the conditional separation module (Conditional TSE module), the model can extract specific sources according to the prompts. 3. **Cross - prompt module**: Through the cross - prompt module (cross - prompt module), the model can jointly model the encoded features and prompts, so that each prompt can be conditioned on the mixed audio and other prompts. The experimental results show that the TUSS model can successfully handle five major source separation tasks (speech enhancement, speech separation, general sound separation, music source separation, and cinematic audio source separation), and can flexibly control the output according to the user's needs during inference. In addition, the paper also demonstrates the flexibility of the TUSS model when dealing with prompt combinations not seen during the training process, further proving its potential in practical applications.

Task-Aware Unified Source Separation

Universal Source Separation with Weakly Labelled Data

Audio Prompt Tuning for Universal Sound Separation

Separate Anything You Describe

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Multi-Task Learning for Blind Source Separation

SADDEL: Joint Speech Separation and Denoising Model based on Multitask Learning

Sampling-Frequency-Independent Universal Sound Separation

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

GASS: Generalizing Audio Source Separation with Large-scale Data

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Music Source Separation in the Waveform Domain

Acoustic-Scene-Aware Target Sound Separation With Sound Embedding Refinement

Audio query-based music source separation

Video-Guided Sound Source Separation

What's All the FUSS About Free Universal Sound Separation Data?

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation