A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Karn N. Watcharasupat,Chih-Wei Wu,Yiwei Ding,Iroro Orife,Aaron J. Hipple,Phillip A. Williams,Scott Kramer,Alexander Lerch,William Wolcott
DOI: https://doi.org/10.1109/OJSP.2023.3339428
2023-12-02
Abstract:Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
Audio and Speech Processing,Machine Learning,Sound,Signal Processing
What problem does this paper attempt to address?
The paper aims to address the problem of Cinematic Audio Source Separation (CASS), specifically targeting the extraction of the three main components: dialogue, music, and effects from mixed audio. To tackle this issue, the authors developed a Generalized Bandsplit Neural Network, which can handle any complete or overcomplete frequency axis partitioning. To enhance the model's reliability and feature extraction capability, the authors employed a psychoacoustically driven frequency scale to define the bands, and these bands now possess redundancy. Additionally, a loss function combining Signal-to-Noise Ratio (SNR) loss and L1 norm sparsity was proposed. To reduce computational complexity during training and inference, improve the separation performance of hard-to-generalize audio categories, and provide flexibility during inference, the authors utilized a shared encoder structure, allowing multiple decoders to share information. The best model mentioned in the paper achieved state-of-the-art performance on the Divide and Remaster dataset, particularly surpassing the Ideal Ratio Mask (IRM) in extracting dialogue components. Therefore, this research not only advances the technical level of the cinematic audio source separation field but also provides valuable references for practical applications.