Abstract:This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in SDR, whereas the top-performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7 dB. A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.

What problem does this paper attempt to address?

The paper primarily addresses the problem of cinematic source separation, specifically aiming to separate dialogue, music, and sound effects from movie audio. By organizing the Cinematic Demixing Track (CDX) in the Sound Demixing Challenge 2023 (SDX'23), the paper aims to advance research in this field. To achieve this goal, the authors undertook the following tasks: 1. **Designing the challenge structure**: The CDX track required participants to submit systems capable of separating dialogue, sound effects, and music from stereo movie audio. The challenge was divided into two leaderboards: one that only allowed the use of the synthetic dataset Divide and Remaster (DnR) for training models, and another that permitted the use of any data for training. 2. **Building the dataset**: To evaluate the submitted systems, the authors constructed a new hidden test dataset, CDXDB23, composed of real movie audio. This dataset was carefully selected to ensure a balanced distribution of dialogue, sound effects, and music. 3. **Establishing baselines**: A baseline model based on Multi-Resolution CrossNet (MRX) was provided. This model was pre-trained on the DnR dataset and applied scaling to the input mixed audio to enhance performance. 4. **Evaluation metrics**: The global Signal-to-Distortion Ratio (SDR) was used as the evaluation metric to measure the quality of the separation results. Key findings of the paper include: - When trained only on the DnR dataset, the best system showed a 1.8 dB improvement over the baseline model. - In the open leaderboard (allowing training with any data), the top-performing system achieved a significant 5.7 dB improvement in SDR over the baseline model. - One of the successful strategies of high-performance systems was improving the match between synthetic data and real movie audio, particularly in emotional speech processing. - The dialogue source benefited the most from additional training data, likely because more speech and vocal material helped the model better handle emotionally rich dialogue. In summary, the paper promotes the development of cinematic source separation technology through the organization of the challenge and demonstrates the potential of deep learning models trained on specific datasets to solve real-world problems.

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

Sound Demixing Challenge 2023 Music Demixing Track Technical Report: TFC-TDF-UNet v3

Benchmarks and leaderboards for sound demixing tasks

The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

The ICASSP SP Cadenza Challenge: Music Demixing/Remixing for Hearing Aids

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Music demixing with the sliCQ transform

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

Overview of the L3DAS23 Challenge on Audio-Visual Extended Reality

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation

Summary of the DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments

The first Cadenza challenges: using machine learning competitions to improve music for listeners with a hearing loss

Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

A Benchmark of State-of-the-Art Sound Event Detection Systems Evaluated on Synthetic Soundscapes

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels