Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

Tornike Karchkhadze,Mohammad Rasool Izadi,Shlomo Dubnov
2024-10-15
Abstract:Diffusion models have recently shown strong potential in both music generation and music source separation tasks. Although in early stages, a trend is emerging towards integrating these tasks into a single framework, as both involve generating musically aligned parts and can be seen as facets of the same generative process. In this work, we introduce a latent diffusion-based multi-track generation model capable of both source separation and multi-track music synthesis by learning the joint probability distribution of tracks sharing a musical context. Our model also enables arrangement generation by creating any subset of tracks given the others. We trained our model on the Slakh2100 dataset, compared it with an existing simultaneous generation and separation model, and observed significant improvements across objective metrics for source separation, music, and arrangement generation tasks. Sound examples are available at <a class="link-external link-https" href="https://msg-ld.github.io/" rel="external noopener nofollow">this https URL</a>.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is integrating the tasks of music generation and music source separation into a unified framework. Specifically: 1. **Music Generation**: Generating multi-track music, including unconditional generation (i.e., generating new musical pieces without relying on any input) and partial generation (i.e., given some tracks, generating the missing tracks, similar to music arrangement). 2. **Music Source Separation**: Separating individual instruments or sound elements from mixed audio. Traditional approaches usually handle these two tasks separately, but this paper proposes a method based on the Latent Diffusion Model (LDM), called MSG-LD (Music Separation and Generation with Latent Diffusion), which can accomplish both tasks within the same model. By learning the joint probability distribution of multi-track music, MSG-LD can not only generate new musical pieces but also separate individual tracks from mixed audio and generate missing tracks given some tracks. This method significantly outperforms existing models that handle both generation and separation tasks simultaneously, such as the Multi-Source Diffusion Model (MSDM), on multiple evaluation metrics. Experimental results show that MSG-LD excels in source separation, overall music generation, and arrangement generation tasks.