The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in Symbolic Music Generation

Guowei Wu,Shipei Liu,Xiaoya Fan
DOI: https://doi.org/10.1109/taslp.2023.3263797
2023-01-01
Abstract:Symbolic music generation relies on the contextual representation capabilities of the generative model, where the most prevalent approach is the Transformer-based model. Learning contextual representations are also related to the structural elements in music, i.e., intro, verse, and chorus, which have not received much attention of scientific publications. In this paper, we propose a hierarchical Transformer model to learn multiscale contexts in music. In the encoding phase, we first design a fragment scope localization module to separate the music parts into chords and sections. Then, we use a multiscale attention mechanism to learn note-, chord-, and section-level contexts. In the decoding phase, we propose a hierarchical Transformer model that uses fine decoders to generate sections in parallel and a coarse decoder to decode the combined music. We also designed a music style normalization layer to achieve a consistent music style between the generated sections. Our model is evaluated on two open MIDI datasets. Experiments show that our model outperforms other comparative models in 50 (6 out of 12 metrics) and 83.3 (10 out of 12 metrics) of the quantitative metrics for short- and long-term music generation, respectively. Preliminary visual analysis also suggests its potential in following compositional rules, such as reuse of rhythmic patterns and critical melodies, which are associated with improved music quality.
engineering, electrical & electronic,acoustics
What problem does this paper attempt to address?