Task-Aware Unified Source Separation

Kohei Saijo,Janek Ebbers,François G. Germain,Gordon Wichern,Jonathan Le Roux
2024-10-31
Abstract:Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to construct a unified model capable of handling all major audio source separation tasks. Specifically, although existing multi - task audio source separation models can perform well on multiple tasks, they encounter difficulties when dealing with tasks with contradictory goals. For example, in music source separation (MSS), instruments need to be separated, while in cinematic audio source separation (CASS), instruments need to be grouped together. Therefore, designing a model that can handle these contradictory tasks and flexibly respond to different separation requirements is a challenge. To solve this problem, the authors propose a **Task - Aware Unified Source Separation (TUSS) model**. This model uses learnable prompts to specify the sources to be separated and adjusts its behavior according to different prompts. The key features of the TUSS model include: 1. **Variable number of prompts**: The model can accept any number of prompts, enabling it to handle different numbers of output sources. 2. **Conditional separation module**: Through the conditional separation module (Conditional TSE module), the model can extract specific sources according to the prompts. 3. **Cross - prompt module**: Through the cross - prompt module (cross - prompt module), the model can jointly model the encoded features and prompts, so that each prompt can be conditioned on the mixed audio and other prompts. The experimental results show that the TUSS model can successfully handle five major source separation tasks (speech enhancement, speech separation, general sound separation, music source separation, and cinematic audio source separation), and can flexibly control the output according to the user's needs during inference. In addition, the paper also demonstrates the flexibility of the TUSS model when dealing with prompt combinations not seen during the training process, further proving its potential in practical applications.