The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

Stefan Uhlich,Giorgio Fabbro,Masato Hirano,Shusuke Takahashi,Gordon Wichern,Jonathan Le Roux,Dipam Chakraborty,Sharada Mohanty,Kai Li,Yi Luo,Jianwei Yu,Rongzhi Gu,Roman Solovyev,Alexander Stempkovskiy,Tatiana Habruseva,Mikhail Sukhovei,Yuki Mitsufuji
2024-04-18
Abstract:This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in SDR, whereas the top-performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7 dB. A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The paper primarily addresses the problem of cinematic source separation, specifically aiming to separate dialogue, music, and sound effects from movie audio. By organizing the Cinematic Demixing Track (CDX) in the Sound Demixing Challenge 2023 (SDX'23), the paper aims to advance research in this field. To achieve this goal, the authors undertook the following tasks: 1. **Designing the challenge structure**: The CDX track required participants to submit systems capable of separating dialogue, sound effects, and music from stereo movie audio. The challenge was divided into two leaderboards: one that only allowed the use of the synthetic dataset Divide and Remaster (DnR) for training models, and another that permitted the use of any data for training. 2. **Building the dataset**: To evaluate the submitted systems, the authors constructed a new hidden test dataset, CDXDB23, composed of real movie audio. This dataset was carefully selected to ensure a balanced distribution of dialogue, sound effects, and music. 3. **Establishing baselines**: A baseline model based on Multi-Resolution CrossNet (MRX) was provided. This model was pre-trained on the DnR dataset and applied scaling to the input mixed audio to enhance performance. 4. **Evaluation metrics**: The global Signal-to-Distortion Ratio (SDR) was used as the evaluation metric to measure the quality of the separation results. Key findings of the paper include: - When trained only on the DnR dataset, the best system showed a 1.8 dB improvement over the baseline model. - In the open leaderboard (allowing training with any data), the top-performing system achieved a significant 5.7 dB improvement in SDR over the baseline model. - One of the successful strategies of high-performance systems was improving the match between synthetic data and real movie audio, particularly in emotional speech processing. - The dialogue source benefited the most from additional training data, likely because more speech and vocal material helped the model better handle emotionally rich dialogue. In summary, the paper promotes the development of cinematic source separation technology through the organization of the challenge and demonstrates the potential of deep learning models trained on specific datasets to solve real-world problems.