Abstract:Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a "source latent." The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at <a class="link-external link-https" href="https://github.com/XZWY/MSLDM" rel="external noopener nofollow">this https URL</a>. Demos are available at <a class="link-external link-https" href="https://xzwy.github.io/MSLDMDemo/" rel="external noopener nofollow">this https URL</a>.

Long-form music generation with latent diffusion

Fast Timing-Conditioned Latent Audio Diffusion

Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model

Bass Accompaniment Generation via Latent Diffusion

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Musical Form Generation

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Multi-Source Music Generation with Latent Diffusion

Progressive distillation diffusion for raw music generation

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

Controllable Music Production with Diffusion Models and Guidance Gradients

Large Language Models: From Notes to Musical Form

Combining audio control and style transfer using latent diffusion

Symbolic Music Generation with Diffusion Models

Generation or Replication: Auscultating Audio Latent Diffusion Models

ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music