Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model

Shipei Liu,Xiaoya Fan,Guowei Wu
2024-08-04
Abstract:Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using diffusion models to generalize notes by injecting frequency-domain Gaussian noise. However, due to the low-density nature of music symbols, estimating the distribution of notes latent in the high-density solution space poses significant challenges. To address this problem, we introduce the Music-Diff architecture, which fits a joint distribution of notes and accompanying semantic information to generate symbolic music conditionally. We first enhance the fragmentation module for extracting semantics by using event-based notations and the structural similarity index, thereby preventing boundary blurring. As a prerequisite for multivariate perturbation, we introduce a joint pre-training method to construct the progressions between notes and musical semantics while avoiding direct modeling of low-density notes. Finally, we recover the perturbed notes by a multi-branch denoiser that fits multiple noise objectives via Pareto optimization. Our experiments suggest that in contrast to language models, joint probability diffusion models perturbing at both note and semantic levels can provide more sample diversity and compositional regularity. The case study highlights the rhythmic advantages of our model over language- and DDPMs-based models by analyzing the hierarchical structure expressed in the self-similarity metrics.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems encountered by existing music generation models when dealing with symbolic music: 1. **Ignoring the continuity of note frequencies**: - Most existing music generation models are based on language models and overlook the continuity characteristics of note frequencies. This results in insufficient fitting of the distribution of rare or never - used notes, thereby reducing the diversity of generated samples. 2. **Difficulty in estimating the distribution of low - density music symbols**: - Due to the low - density characteristics of music symbols (that is, most notes rarely appear in actual music), it becomes very challenging to estimate the distribution of these notes in a high - density solution space. 3. **Lack of joint modeling of music structure and semantic information**: - Existing methods usually only focus on the notes themselves and ignore the structure and semantic information of music. This makes the generated music deficient in terms of diversity and structure. To solve these problems, the author proposes the **Music - Diff architecture**, which improves existing music generation models in the following ways: - **Introducing multivariate perturbation and joint probability diffusion models**: By performing perturbations at the note and semantic levels and using a joint probability diffusion model to generate symbolic music. - **Enhancing the fragmentation module**: Using event - driven notation and the Structural Similarity Index (SSIM) to prevent boundary blurring, thereby extracting music structure elements more precisely. - **Joint semantic pre - training**: By constructing the joint distribution among notes, chords, and paragraphs, avoiding direct modeling of low - density notes. - **Multi - branch denoiser**: Using Pareto optimization to restore perturbed notes, ensuring that the generated music samples are more superior in terms of diversity and structure. Overall, this paper solves the problems of insufficient diversity and structure in existing models when dealing with symbolic music generation by proposing the Music - Diff architecture, and has made significant progress especially in dealing with never - used notes.