Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Karn N. Watcharasupat,Chih-Wei Wu,Iroro Orife
2024-08-26
Abstract:Cinematic audio source separation (CASS), as a standalone problem of extracting individual stems from their mixture, is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue (DX), music (MX), and effects (FX) stems. Given the creative nature of cinematic sound production, however, several edge cases exist; some sound sources do not fit neatly in any of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX or neither, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at <a class="link-external link-https" href="https://github.com/kwatcharasupat/source-separation-landing" rel="external noopener nofollow">this https URL</a>.
Audio and Speech Processing,Artificial Intelligence,Machine Learning,Sound
What problem does this paper attempt to address?
This paper aims to address the issue of singing voice separation in Cinematic Audio Source Separation (CASS). Specifically, the authors attempt to distinguish between singing voice, dialogue, instrumental music, and sound effects in a 4-channel separation task. Traditional CASS methods typically use a 3-channel setup (dialogue, music, effects), but this approach falls short in handling certain special scenarios, particularly when the singing voice neither belongs to dialogue nor entirely to music. Therefore, this paper proposes an extended method by adding an extra channel specifically for handling the singing voice, thereby improving existing models (Bandit and Banquet models). Experimental results show that in handling singing voice and other channel separations, the query-based single-decoder Banquet model outperforms the dedicated decoder Bandit model. This may be because the Banquet model achieves better feature alignment at the bottleneck through the band-agnostic FiLM layer. Overall, this study aims to further enhance the flexibility and accuracy of CASS systems by introducing a singing voice channel.