A Two-Stage Band-Split Mamba-2 Network For Music Separation

Jinglin Bai,Yuan Fang,Jiajie Wang,Xueliang Zhang

2024-09-14

Abstract:Music source separation (MSS) aims to separate mixed music into its distinct tracks, such as vocals, bass, drums, and more. MSS is considered to be a challenging audio separation task due to the complexity of music signals. Although the RNN and Transformer architecture are not perfect, they are commonly used to model the music sequence for MSS. Recently, Mamba-2 has already demonstrated high efficiency in various sequential modeling tasks, but its superiority has not been investigated in MSS. This paper applies Mamba-2 with a two-stage strategy, which introduces residual mapping based on the mask method, effectively compensating for the details absent in the mask and further improving separation performance. Experiments confirm the superiority of bidirectional Mamba-2 and the effectiveness of the two-stage network in MSS. The source code is publicly accessible at <a class="link-external link-https" href="https://github.com/baijinglin/TS-BSmamba2" rel="external noopener nofollow">this https URL</a>.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the problem of Music Source Separation (MSS), which involves separating mixed music signals into different tracks such as vocals, bass, drums, etc. Specifically, the paper proposes a Two-Stage Band-Split Network based on the Mamba-2 architecture (TS-BSMAMBA2) to improve the effectiveness of music source separation. MSS is a challenging task because music signals are highly complex. Although Recurrent Neural Networks (RNN) and Transformer architectures are widely used in MSS, they have some limitations, such as difficulty in parallelization, gradient vanishing or exploding, and high computational cost. Mamba-2, as a new architecture, effectively addresses these limitations through a Structured State Space Duality (SSD) framework. The main contributions of the paper include: 1. Applying Mamba-2 to the MSS task for the first time. 2. Proposing a two-stage approach, where the first stage estimates the complex-valued masks of different tracks to learn coarse features; the second stage predicts the residual mapping to capture fine-grained features. 3. Experimental results demonstrate the effectiveness and superiority of the proposed method, especially in terms of lightweight models, maintaining good performance while reducing computational cost. Through experimental evaluation on the MUSDB18-HQ dataset, the paper showcases the superior performance of TS-BSMAMBA2 on multiple tracks, with fewer parameters and lower computational complexity compared to some existing baseline models.

A Two-Stage Band-Split Mamba-2 Network For Music Separation

Music Source Separation With Band-Split RNN

Music Source Separation with Band-Split RoPE Transformer

Stereophonic Music Source Separation with Spatially-Informed Bridging Band-Split Network.

Music Source Separation Based on a Lightweight Deep Learning Framework (DTTNET: DUAL-PATH TFC-TDF UNET)

The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

Attention‐based Neural Network for End‐to‐end Music Separation

The whole is greater than the sum of its parts: improving music source separation by bridging networks

An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

Multi-stage music separation network with dual-branch attention and hybrid convolution

Deep Representation-Decoupling Neural Networks for Monaural Music Mixture Separation

SPMamba: State-space model is all you need in speech separation

SepMamba: State-space models for speaker separation using Mamba

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

Music Source Separation in the Waveform Domain

Deep Neural Network Based Audio Source Separation

Hierarchic Temporal Convolutional Network With Cross-Domain Encoder for Music Source Separation

SCNet: Sparse Compression Network for Music Source Separation