Nicolas Jonason,Luca Casini,Bob L.T. Sturm
Abstract:We present a new approach for fast and controllable generation of symbolic music based on the simplex diffusion, which is essentially a diffusion process operating on probabilities rather than the signal space. This objective has been applied in domains such as natural language processing but here we apply it to generating 4-bar multi-instrument music loops using an orderless representation. We show that our model can be steered with vocabulary priors, which affords a considerable level control over the music generation process, for instance, infilling in time and pitch and choice of instrumentation -- all without task-specific model adaptation or applying extrinsic control.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to achieve fast and controllable symbolic music generation. Specifically, the author proposes a new method based on Simplex Diffusion (SD) for generating 4 - bar multi - instrument MIDI loop music. This method achieves a high degree of controllability over the generation process by performing a diffusion process on the probability distribution rather than operating directly in the signal space.
### Main problems and solutions
1. **Limitations of existing methods**:
- Traditional diffusion models (Diffusion Models) usually perform diffusion in the signal domain or embedding space when generating symbolic music, which makes it more complicated to control the generation process.
- Existing symbolic music generation methods are difficult to flexibly control the generated content (such as time, pitch, instrument selection, etc.) without fine - tuning for specific tasks.
2. **Proposed solutions**:
- **Simplex Diffusion (SD)**: This method applies the diffusion process to the probability distribution rather than directly to the signal itself. This makes the diffusion process continuous even on discrete signals, thus simplifying the implementation of external control.
- **Vocabulary Priors**: By introducing vocabulary priors, the generation process can be easily guided during the inference process. For example, the time, pitch, or instrument of certain notes can be specified, thus achieving precise control.
- **Orderless Representation**: Using an unordered note - set representation allows any note property to be flexibly regenerated during the generation process without violating the syntax of the representation.
### Specific contributions
1. **SYMPLEX model**: Proposed SYMPLEX, a model based on Simplex Diffusion, for generating 4 - bar multi - instrument MIDI loop music. To the author's knowledge, this is the first time that Simplex Diffusion has been applied to symbolic music generation.
2. **Controllable generation**: Demonstrated how to control the generation process through vocabulary priors to handle different music generation tasks, such as time - pitch filling, instrument conditioning, rhythm and tonality control, etc.
3. **Improved loop extraction technique**: Adapted and extended the context - based loop extraction technique, combined with metric structure heuristic methods, to obtain better music loops.
### Method overview
- **Training process**: Recover data samples from noise probabilities through the neural network θ. Each training step includes generating initial logits, adding noise, applying softmax to obtain a probability distribution, and updating the network through cross - entropy loss.
- **Inference process**: Start from randomly initialized probabilities and iteratively refine these probabilities to finally generate new samples.
- **Vocabulary prior injection**: Achieve control over the generation process by multiplying by the vocabulary prior, normalizing the result, and then inputting it into the neural network.
### Experiments and applications
- **Dataset**: Use 430k multi - track MIDI files in the MetaMIDI dataset, and extract approximately 250,000 4 - bar MIDI loops after processing.
- **Generation tasks**: Demonstrated multiple generation tasks, including unconditional generation, conditional generation (such as specifying instrument / pitch constraints), and editing tasks (such as in - box filling, generating variants, replacing bass, etc.).
### Future work
- **Automation of parameter adjustment**: Currently, different tasks require manual adjustment of the number of generation steps T and the top - p threshold. Future work will focus on automatically optimizing these parameters to improve generation efficiency and user experience.
Through these innovations, the SYMPLEX model provides a new, efficient, and controllable tool for symbolic music generation.