Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

Tornike Karchkhadze,Mohammad Rasool Izadi,Shlomo Dubnov

2024-10-15

Abstract:Diffusion models have recently shown strong potential in both music generation and music source separation tasks. Although in early stages, a trend is emerging towards integrating these tasks into a single framework, as both involve generating musically aligned parts and can be seen as facets of the same generative process. In this work, we introduce a latent diffusion-based multi-track generation model capable of both source separation and multi-track music synthesis by learning the joint probability distribution of tracks sharing a musical context. Our model also enables arrangement generation by creating any subset of tracks given the others. We trained our model on the Slakh2100 dataset, compared it with an existing simultaneous generation and separation model, and observed significant improvements across objective metrics for source separation, music, and arrangement generation tasks. Sound examples are available at <a class="link-external link-https" href="https://msg-ld.github.io/" rel="external noopener nofollow">this https URL</a>.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The problem this paper attempts to address is integrating the tasks of music generation and music source separation into a unified framework. Specifically: 1. **Music Generation**: Generating multi-track music, including unconditional generation (i.e., generating new musical pieces without relying on any input) and partial generation (i.e., given some tracks, generating the missing tracks, similar to music arrangement). 2. **Music Source Separation**: Separating individual instruments or sound elements from mixed audio. Traditional approaches usually handle these two tasks separately, but this paper proposes a method based on the Latent Diffusion Model (LDM), called MSG-LD (Music Separation and Generation with Latent Diffusion), which can accomplish both tasks within the same model. By learning the joint probability distribution of multi-track music, MSG-LD can not only generate new musical pieces but also separate individual tracks from mixed audio and generate missing tracks given some tracks. This method significantly outperforms existing models that handle both generation and separation tasks simultaneously, such as the Multi-Source Diffusion Model (MSDM), on multiple evaluation metrics. Experimental results show that MSG-LD excels in source separation, overall music generation, and arrangement generation tasks.

Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

Multi-Source Music Generation with Latent Diffusion

Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models

Multimodal Latent Language Modeling with Next-Token Diffusion

Long-form music generation with latent diffusion

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Bass Accompaniment Generation via Latent Diffusion

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Controllable Music Production with Diffusion Models and Guidance Gradients

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

Progressive distillation diffusion for raw music generation

Combining audio control and style transfer using latent diffusion

Music Separation Enhancement with Generative Modeling

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

DiffuseRoll: Multi-track multi-category music generation based on diffusion model

Unsupervised Composable Representations for Audio