SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints

Haonan Chen,Jordan B. L. Smith,Janne Spijkervet,Ju-Chiang Wang,Pei Zou,Bochen Li,Qiuqiang Kong,Xingjian Du
2024-09-10
Abstract:Progress in the task of symbolic music generation may be lagging behind other tasks like audio and text generation, in part because of the scarcity of symbolic training data. In this paper, we leverage the greater scale of audio music data by applying pre-trained MIR models (for transcription, beat tracking, structure analysis, etc.) to extract symbolic events and encode them into token sequences. To the best of our knowledge, this work is the first to demonstrate the feasibility of training symbolic generation models solely from auto-transcribed audio data. Furthermore, to enhance the controllability of the trained model, we introduce SymPAC (Symbolic Music Language Model with Prompting And Constrained Generation), which is distinguished by using (a) prompt bars in encoding and (b) a technique called Constrained Generation via Finite State Machines (FSMs) during inference time. We show the flexibility and controllability of this approach, which may be critical in making music AI useful to creators and users.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main problems in the field of symbolic music generation: 1. **Data scarcity problem**: - The progress of symbolic music generation tasks may lag behind audio and text generation tasks, partly due to the scarcity of symbolic training data. Compared with audio and text data, high - quality symbolic music datasets are smaller and more difficult to obtain. - The paper proposes a new method. By using pre - trained Music Information Retrieval (MIR) models (such as transcription, beat tracking, structural analysis, etc.), it extracts symbolic events from large - scale audio music data and encodes them into token sequences. In this way, rich audio data can be used to train symbolic music generation models without relying on manually - annotated symbolic music data. 2. **Controllability problem**: - In the process of symbolic music generation, how to integrate user input to control the generation results is an important research topic. Traditional generation models usually lack fine - grained control over the generation process, which is crucial for creators. - The paper introduces the SymPAC framework (Symbolic Music Language Model with Prompting And Constrained Generation), which enhances the controllability of the model in the following two ways: - **Prompt Bars**: Before encoding the actual notes, all control signals are integrated into a single prompt section. This enables the decoder - only language model to fully understand the context of the control signals when generating music. - **Constrained Generation via Finite State Machines (FSMs)**: At the inference stage, FSMs are used to constrain the token sampling at each time step, ensuring that the generated tokens not only conform to the encoding grammar but also follow the user input. ### Summary The main contributions of this paper include: - **Scalability**: It shows that high - quality multi - track symbolic music generation models can be trained using only transcribed audio data, without the need for manually - annotated symbolic music data, and can be expanded by accumulating more audio data. - **Controllability**: The SymPAC framework is proposed, enabling users to flexibly input control signals on the decoder - only language model while maintaining good generation quality. These improvements make symbolic music generation more practical and better meet the needs of creators and users.