A Two-Stage Band-Split Mamba-2 Network For Music Separation

Jinglin Bai,Yuan Fang,Jiajie Wang,Xueliang Zhang
2024-09-14
Abstract:Music source separation (MSS) aims to separate mixed music into its distinct tracks, such as vocals, bass, drums, and more. MSS is considered to be a challenging audio separation task due to the complexity of music signals. Although the RNN and Transformer architecture are not perfect, they are commonly used to model the music sequence for MSS. Recently, Mamba-2 has already demonstrated high efficiency in various sequential modeling tasks, but its superiority has not been investigated in MSS. This paper applies Mamba-2 with a two-stage strategy, which introduces residual mapping based on the mask method, effectively compensating for the details absent in the mask and further improving separation performance. Experiments confirm the superiority of bidirectional Mamba-2 and the effectiveness of the two-stage network in MSS. The source code is publicly accessible at <a class="link-external link-https" href="https://github.com/baijinglin/TS-BSmamba2" rel="external noopener nofollow">this https URL</a>.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the problem of Music Source Separation (MSS), which involves separating mixed music signals into different tracks such as vocals, bass, drums, etc. Specifically, the paper proposes a Two-Stage Band-Split Network based on the Mamba-2 architecture (TS-BSMAMBA2) to improve the effectiveness of music source separation. MSS is a challenging task because music signals are highly complex. Although Recurrent Neural Networks (RNN) and Transformer architectures are widely used in MSS, they have some limitations, such as difficulty in parallelization, gradient vanishing or exploding, and high computational cost. Mamba-2, as a new architecture, effectively addresses these limitations through a Structured State Space Duality (SSD) framework. The main contributions of the paper include: 1. Applying Mamba-2 to the MSS task for the first time. 2. Proposing a two-stage approach, where the first stage estimates the complex-valued masks of different tracks to learn coarse features; the second stage predicts the residual mapping to capture fine-grained features. 3. Experimental results demonstrate the effectiveness and superiority of the proposed method, especially in terms of lightweight models, maintaining good performance while reducing computational cost. Through experimental evaluation on the MUSDB18-HQ dataset, the paper showcases the superior performance of TS-BSMAMBA2 on multiple tracks, with fewer parameters and lower computational complexity compared to some existing baseline models.