Abstract:Cinematic audio source separation (CASS), as a problem of extracting the dialogue, music, and effects stems from their mixture, is a relatively new subtask of audio source separation. To date, only one publicly available dataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which is currently at version 2. While DnR v2 has been an incredibly useful resource for CASS, several areas of improvement have been identified, particularly through its use in the 2023 Sound Demixing Challenge. In this work, we develop version 3 of the DnR dataset, addressing issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity. In particular, the dialogue stem of DnR v3 includes speech content from more than 30 languages from multiple families including but not limited to the Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu families. Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model even in languages with low data availability. Even in languages with high data availability, the multilingual model often performs on par or better than dedicated models trained on monolingual CASS datasets. Dataset and model implementation will be made available at <a class="link-external link-https" href="https://github.com/kwatcharasupat/source-separation-landing" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the existing Cinematic Audio Source Separation (CASS) datasets, especially to address the deficiencies of the Divide and Remaster (DnR) v2 dataset. Specifically, the authors made improvements in the following aspects: 1. **Problems with voice content in non - dialogue tracks**: In the original DnR v2 dataset, there is voice content in non - dialogue tracks (such as music and effect tracks), which will affect the effectiveness of model training. 2. **Mismatch in loudness distribution**: The loudness distribution of the original dataset is inconsistent with that of actual movie audio, resulting in poor performance of the model in real - world scenarios. 3. **Mastering process**: The original mastering process is not close enough to industry standards, which affects the generalization ability of the model. 4. **Lack of language diversity**: The original dataset mainly contains English dialogues and lacks multi - language support, which limits the application of the model in different language environments. To solve these problems, the authors developed the DnR v3 dataset, aiming to improve the quality and applicability of the dataset, especially in the following aspects: - **Multi - language support**: DnR v3 contains dialogue data from more than 30 languages, covering multiple language families, including the Germanic language family, the Romance language family, the Indo - Aryan language family, the Dravidian language family, the Malayo - Polynesian language family, and the Bantu language family, etc. - **Removal of voice content in non - dialogue tracks**: Ensure that music and effect tracks no longer contain any voice or vocal content. - **Adjustment of loudness and time parameters**: Make the loudness and time distribution of the dataset closer to that of real movie audio. - **Improvement of the mastering process**: Retain the relative loudness between tracks and simulate the mastering process of industry standards. Through these improvements, the authors hope that DnR v3 can better reflect the diversity and complexity in movie audio production, thereby enhancing the performance of the model in practical applications. In addition, the authors also conducted benchmark tests using the Bandit model, and the results show that multi - language models perform well in multiple language environments and even outperform single - language models. ### Summary The main objective of this paper is to solve the deficiencies in the existing CASS datasets by developing the DnR v3 dataset, especially by increasing language diversity, removing voice content in non - dialogue tracks, adjusting the loudness distribution, and improving the mastering process, in order to enhance the performance and generalization ability of the model in actual movie audio processing.

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Benchmarks and leaderboards for sound demixing tasks

PodcastMix: A dataset for separating music and speech in podcasts

MD3: The Multi-Dialect Dataset of Dialogues

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Audio Dialogues: Dialogues dataset for audio and music understanding

GASS: Generalizing Audio Source Separation with Large-scale Data

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation

SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

CrowdSpeech and VoxDIY: Benchmark Datasets for Crowdsourced Audio Transcription

CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization

Improving Choral Music Separation through Expressive Synthesized Data from Sampled Instruments