Abstract:Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at <a class="link-external link-https" href="https://Efficient-MeLoDy.github.io/" rel="external noopener nofollow">this https URL</a>.

StemGen: A music generation model that listens

Music Generation System for Adversarial Training Based on Deep Learning

APE-GAN: A Novel Active Learning Based Music Generation Model with Pre-Embedding

Simple and Controllable Music Generation

A Survey of Music Generation in the Context of Interaction

Deep Learning-Based Music Generation

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Music Generation based on Generative Adversarial Networks with Transformer

DeepJ: Style-Specific Music Generation

SDMuse: Stochastic Differential Music Editing and Generation via Hybrid Representation

Novel LSTM-GAN Based Music Generation

The Usage of Artificial Intelligence Technology in Music Education System Under Deep Learning

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Deep learning for music generation: challenges and directions

Efficient Neural Music Generation

Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures

MusicGen-Chord: Advancing Music Generation through Chord Progressions and Interactive Web-UI

A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions

Deep generative models for musical audio synthesis

Evaluating Deep Music Generation Methods Using Data Augmentation

Generating music with sentiment using Transformer-GANs