Abstract:Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using diffusion models to generalize notes by injecting frequency-domain Gaussian noise. However, due to the low-density nature of music symbols, estimating the distribution of notes latent in the high-density solution space poses significant challenges. To address this problem, we introduce the Music-Diff architecture, which fits a joint distribution of notes and accompanying semantic information to generate symbolic music conditionally. We first enhance the fragmentation module for extracting semantics by using event-based notations and the structural similarity index, thereby preventing boundary blurring. As a prerequisite for multivariate perturbation, we introduce a joint pre-training method to construct the progressions between notes and musical semantics while avoiding direct modeling of low-density notes. Finally, we recover the perturbed notes by a multi-branch denoiser that fits multiple noise objectives via Pareto optimization. Our experiments suggest that in contrast to language models, joint probability diffusion models perturbing at both note and semantic levels can provide more sample diversity and compositional regularity. The case study highlights the rhythmic advantages of our model over language- and DDPMs-based models by analyzing the hierarchical structure expressed in the self-similarity metrics.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems encountered by existing music generation models when dealing with symbolic music: 1. **Ignoring the continuity of note frequencies**: - Most existing music generation models are based on language models and overlook the continuity characteristics of note frequencies. This results in insufficient fitting of the distribution of rare or never - used notes, thereby reducing the diversity of generated samples. 2. **Difficulty in estimating the distribution of low - density music symbols**: - Due to the low - density characteristics of music symbols (that is, most notes rarely appear in actual music), it becomes very challenging to estimate the distribution of these notes in a high - density solution space. 3. **Lack of joint modeling of music structure and semantic information**: - Existing methods usually only focus on the notes themselves and ignore the structure and semantic information of music. This makes the generated music deficient in terms of diversity and structure. To solve these problems, the author proposes the **Music - Diff architecture**, which improves existing music generation models in the following ways: - **Introducing multivariate perturbation and joint probability diffusion models**: By performing perturbations at the note and semantic levels and using a joint probability diffusion model to generate symbolic music. - **Enhancing the fragmentation module**: Using event - driven notation and the Structural Similarity Index (SSIM) to prevent boundary blurring, thereby extracting music structure elements more precisely. - **Joint semantic pre - training**: By constructing the joint distribution among notes, chords, and paragraphs, avoiding direct modeling of low - density notes. - **Multi - branch denoiser**: Using Pareto optimization to restore perturbed notes, ensuring that the generated music samples are more superior in terms of diversity and structure. Overall, this paper solves the problems of insufficient diversity and structure in existing models when dealing with symbolic music generation by proposing the Music - Diff architecture, and has made significant progress especially in dealing with never - used notes.

Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model

Generating symbolic music using diffusion models

Discrete Diffusion Probabilistic Models for Symbolic Music Generation

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models

SDMuse: Stochastic Differential Music Editing and Generation via Hybrid Representation

Impromptu Accompaniment of Pop Music Using Coupled Latent Variable Model with Binary Regularizer

Symbolic Music Generation with Diffusion Models

Deep Generative Models of Music Expectation

DiffuseRoll: Multi-track multi-category music generation based on diffusion model

Multi-Source Music Generation with Latent Diffusion

Generating High-quality Symbolic Music Using Fine-grained Discriminators

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Taming Diffusion Models for Music-driven Conducting Motion Generation

Modelling Symbolic Music: Beyond the Piano Roll

Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion

Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions

Continuous Melody Generation via Disentangled Short-Term Representations and Structural Conditions

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models