Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Karn N. Watcharasupat,Chih-Wei Wu,Iroro Orife

2024-08-26

Abstract:Cinematic audio source separation (CASS), as a standalone problem of extracting individual stems from their mixture, is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue (DX), music (MX), and effects (FX) stems. Given the creative nature of cinematic sound production, however, several edge cases exist; some sound sources do not fit neatly in any of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX or neither, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at <a class="link-external link-https" href="https://github.com/kwatcharasupat/source-separation-landing" rel="external noopener nofollow">this https URL</a>.

Audio and Speech Processing,Artificial Intelligence,Machine Learning,Sound

What problem does this paper attempt to address?

This paper aims to address the issue of singing voice separation in Cinematic Audio Source Separation (CASS). Specifically, the authors attempt to distinguish between singing voice, dialogue, instrumental music, and sound effects in a 4-channel separation task. Traditional CASS methods typically use a 3-channel setup (dialogue, music, effects), but this approach falls short in handling certain special scenarios, particularly when the singing voice neither belongs to dialogue nor entirely to music. Therefore, this paper proposes an extended method by adding an extra channel specifically for handling the singing voice, thereby improving existing models (Bandit and Banquet models). Experimental results show that in handling singing voice and other channel separations, the query-based single-decoder Banquet model outperforms the dedicated decoder Bandit model. This may be because the Banquet model achieves better feature alignment at the bottleneck through the band-agnostic FiLM layer. Overall, this study aims to further enhance the flexibility and accuracy of CASS systems by introducing a singing voice channel.

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Audiovisual Singing Voice Separation

GASS: Generalizing Audio Source Separation with Large-scale Data

Zero-Shot Duet Singing Voices Separation with Diffusion Models

An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation

Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

Resource-constrained stereo singing voice cancellation

A Deep-Learning Based Framework for Source Separation, Analysis, and Synthesis of Choral Ensembles

Task-Aware Unified Source Separation

Separate Anything You Describe

Music Source Separation in the Waveform Domain

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation

Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

DJCM: A Deep Joint Cascade Model for Singing Voice Separation and Vocal Pitch Estimation