A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Karn N. Watcharasupat,Chih-Wei Wu,Yiwei Ding,Iroro Orife,Aaron J. Hipple,Phillip A. Williams,Scott Kramer,Alexander Lerch,William Wolcott

DOI: https://doi.org/10.1109/OJSP.2023.3339428

2023-12-02

Abstract:Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.

Audio and Speech Processing,Machine Learning,Sound,Signal Processing

What problem does this paper attempt to address?

The paper aims to address the problem of Cinematic Audio Source Separation (CASS), specifically targeting the extraction of the three main components: dialogue, music, and effects from mixed audio. To tackle this issue, the authors developed a Generalized Bandsplit Neural Network, which can handle any complete or overcomplete frequency axis partitioning. To enhance the model's reliability and feature extraction capability, the authors employed a psychoacoustically driven frequency scale to define the bands, and these bands now possess redundancy. Additionally, a loss function combining Signal-to-Noise Ratio (SNR) loss and L1 norm sparsity was proposed. To reduce computational complexity during training and inference, improve the separation performance of hard-to-generalize audio categories, and provide flexibility during inference, the authors utilized a shared encoder structure, allowing multiple decoders to share information. The best model mentioned in the paper achieved state-of-the-art performance on the Divide and Remaster dataset, particularly surpassing the Ideal Ratio Mask (IRM) in extracting dialogue components. Therefore, this research not only advances the technical level of the cinematic audio source separation field but also provides valuable references for practical applications.

A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Music Source Separation With Band-Split RNN

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

End-to-end Networks for Supervised Single-channel Speech Separation

SCNet: Sparse Compression Network for Music Source Separation

GASS: Generalizing Audio Source Separation with Large-scale Data

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Improving Universal Sound Separation Using Sound Classification

End-to-end Non-Negative Autoencoders for Sound Source Separation

Sound field decomposition based on two-stage neural networks

SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation

Binaural Angular Separation Network

Music source separation conditioned on 3D point clouds

The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

AudioSlots: A slot-centric generative model for audio separation

Sound Source Separation Using Latent Variational Block-Wise Disentanglement

The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

Source Separation of Multi-source Raw Music using a Residual Quantized Variational Autoencoder

Semantic Grouping Network for Audio Source Separation