Symbolic Music Generation with Fine-grained Interactive Textural Guidance

Tingyu Zhu,Haoyu Liu,Zhimin Jiang,Zeyu Zheng
2024-10-11
Abstract:The problem of symbolic music generation presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To overcome these difficulties, we introduce Fine-grained Textural Guidance (FTG) within diffusion models to correct errors in the learned distributions. By incorporating FTG, the diffusion models improve the accuracy of music generation, which makes them well-suited for advanced tasks such as progressive music generation, improvisation and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effect of the FTG approach. We provide numerical experiments and a demo page for interactive music generation with user input to showcase the effectiveness of our approach.
Sound,Artificial Intelligence,Machine Learning,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two key challenges in symbolic music generation: 1. **Harmonic Precision**: - Symbolic music generation requires extremely high precision, especially in terms of pitch. Unlike in image generation where an error in a single pixel may not significantly affect the overall quality, in symbolic music generation, a wrong note can be very obvious, even to less professional listeners. - Through theoretical analysis and empirical observations, the paper reveals the reasons why existing generation models are prone to generate "wrong notes" during the generation process, and points out that these wrong notes are often due to the model's failure to correctly estimate the probability density, resulting in generated notes that do not conform to the current mode or harmonic context. 2. **Rhythmic Regularity**: - Existing symbolic music generation models tend to generate irregular rhythmic patterns. Unlike human composers who usually maintain a consistent rhythmic pattern within consecutive bars, the accompaniment parts generated by the generation models often lack this consistency. - The reason for this phenomenon lies in the scarcity and high - dimensionality of data, which makes it difficult for the model to capture the correlations between different bars. In addition, there are irregular samples in the existing MIDI datasets, which further exacerbates this problem. To address these challenges, the paper proposes a Fine - grained Textural Guidance (FTG) method to improve the quality and stability of generated music by introducing fine - grained harmony and rhythm conditions in the diffusion model. Specific methods include: - **Fine - grained Conditioning in Training**: - Use a conditional diffusion model to provide harmony (C) and rhythm (R) conditions, which are input into the model in the form of a piano roll (Mcond). - Through the idea of classifier - free guidance, randomly apply or not apply the rhythm condition during the training process to balance output stability and sample diversity. - **Fine - grained Control in Sampling Process**: - During the sampling process, adjust the noise prediction value to ensure that the generated notes conform to the current mode and harmonic context and avoid generating "wrong notes". - The specific implementation method is to project the predicted noise value into the domain that conforms to the mode constraints at each sampling step, thereby correcting the errors in the generation process. Through these methods, the paper aims to improve the precision and regularity of symbolic music generation, making it more suitable for advanced tasks such as progressive music generation, improvisation, and interactive music creation.