PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Yungang Yi,Weihua Li,Matthew Kuo,Quan Bai
2024-11-13
Abstract:Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we propose PerceiverS (Segmentation and Scale), a novel architecture designed to address this issue by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. By combining cross-attention and self-attention in a Multi-Scale setting, PerceiverS captures long-range musical structure while preserving performance nuances. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music with both structural consistency and expressive variation. The project demos and the generated music samples can be accessed through the link: <a class="link-external link-https" href="https://perceivers.github.io" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Multimedia,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to maintain both long - term structural consistency and expressive diversity simultaneously in symbolic music generation. Specifically, although significant progress has been made in the field of audio generation, generating symbolic music that has both long - term structure and expressiveness remains a major challenge. The paper proposes a new architecture - Perceiver S (Segmentation and Scale), which solves this problem through effective segmentation and multi - scale attention mechanisms. This method aims to enhance the effect of symbolic music generation by combining cross - attention and self - attention mechanisms to simultaneously learn long - term structural dependencies and short - term expressive details in a multi - scale setting. ### Main contributions of the paper: 1. **Effective segmentation**: Improves the pre - processing strategy of the input sequence and overcomes the limitation of the Perceiver AR model where the causal mask covers the entire input sequence. By randomly selecting the cropping endpoints instead of the longest input sequence, it ensures that the model can start learning from the initial part of the sequence and improves the generation quality. 2. **Multi - scale cross - attention**: Introduces a multi - layer attention mechanism where different layers use different attention lengths, balancing the focus on long - distance and short - distance contexts, thereby enhancing diversity and reducing repetitiveness. ### Specific technical details: - **Input sequence pre - processing**: - Let the complete sequence be \( X=\{x_{1},x_{2},\ldots,x_{l}\}\), where \( l\) is the total length of the entire sequence, \( m\) is the maximum input length that the model can attend to at a time, and \( n\) is the query length. - In the traditional Transformer method, a paragraph of length \( m\) is randomly extracted from the sequence, for example: \[ \hat{X}=\{x_{s},x_{s + 1},\ldots,x_{s+m-1}\} \] - Perceiver S adopts an improved method. It randomly selects a cropping end point \( j\) between \((n + 1, l+ 1)\), and then takes a paragraph of at most \( m\) length forward from this end point, ensuring that the model can start learning from the initial part of the sequence. - **Multi - scale cross - attention mechanism**: - By introducing masks of different scales in the multi - layer cross - attention, the model can perform cross - attention operations simultaneously at multiple scales. - Specifically, one layer does not use a scale mask, and another layer masks out the first \( m - n\) tokens. - The outputs of the cross - attention layers are combined by the cascading method, and the output of each layer is directly used as the input of the next layer, gradually refining and constructing the output. ### Experimental results: - Through experiments on the Maestro dataset, Perceiver S performs excellently in generating coherent and diverse music, especially in terms of structural consistency and expressive diversity. - The experimental results show that Perceiver S has an average improvement of 40% over Perceiver AR in the overlap area index, indicating its significant advantage in generating high - quality symbolic music. ### Conclusion: Perceiver S successfully solves the challenge of generating symbolic music with both long - term structural consistency and diversity through effective segmentation and multi - scale attention mechanisms, providing a new solution for the field of symbolic music generation.