SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

Shuangrui Ding,Zihan Liu,Xiaoyi Dong,Pan Zhang,Rui Qian,Conghui He,Dahua Lin,Jiaqi Wang
2024-02-28
Abstract:We present SongComposer, an innovative LLM designed for song composition. It could understand and generate melodies and lyrics in symbolic song representations, by leveraging the capability of LLM. Existing music-related LLM treated the music as quantized audio signals, while such implicit encoding leads to inefficient encoding and poor flexibility. In contrast, we resort to symbolic song representation, the mature and efficient way humans designed for music, and enable LLM to explicitly compose songs like humans. In practice, we design a novel tuple design to format lyric and three note attributes (pitch, duration, and rest duration) in the melody, which guarantees the correct LLM understanding of musical symbols and realizes precise alignment between lyrics and melody. To impart basic music understanding to LLM, we carefully collected SongCompose-PT, a large-scale song pretraining dataset that includes lyrics, melodies, and paired lyrics-melodies in either Chinese or English. After adequate pre-training, 10K carefully crafted QA pairs are used to empower the LLM with the instruction-following capability and solve diverse tasks. With extensive experiments, SongComposer demonstrates superior performance in lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation, outperforming advanced LLMs like GPT-4.
Sound,Artificial Intelligence,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use large - language models (LLMs) for the creation of lyrics and melodies, especially how to generate melodies that are both in line with the lyrics and harmonious through the symbolic song representation method. Specifically, the paper proposes an innovative large - language model named SongComposer, which aims to understand and generate symbolic song representations, including melodies and lyrics. Different from the existing music - related large - scale language models, these models usually regard music as a quantized audio signal, resulting in low encoding efficiency and poor flexibility. SongComposer, on the other hand, adopts the mature and efficient symbolic song representation method designed by humans, enabling large - language models to create songs explicitly like humans. ### Main Challenges and Solutions: 1. **How to make large - language models learn music**: - Music symbols are very abstract and different from the common use of letters. For example, "F4" in the melody refers to "the F note in the fourth octave", not the paper size or a meaningless combination of letters. - To this end, the author decomposes the melody into triples of three attributes: pitch, duration, and rest time, and uses them as tokens in a new vocabulary. This method retains the actual meaning of music symbols, rather than reusing tokens in the original vocabulary or adopting an abstract representation with an independent audio encoder. - For paired lyrics and melody data, the author aligns the lyrics and melody at the word level to form a sequence containing lyric words and their corresponding musical attributes. 2. **What music knowledge do large - language models need to learn?**: - Large - language models need to master basic music understanding from large - scale song data, including melodies, lyrics, and their precise alignment. - To this end, the author compiles a comprehensive pre - training dataset SongCompose - PT, which contains 280,000 pure lyrics, 20,000 pure melodies, and 15,000 paired lyrics and melodies, covering Chinese and English. 3. **How to make large - language models create songs according to instructions?**: - After sufficient pre - training, large - language models can understand and continue to create songs, but still have difficulties in following flexible instructions. - To overcome this problem, the author designs 10,000 question - and - answer - style dialogues, enabling large - language models to solve a wide range of tasks in the song - generation field. ### Main Contributions: - **Proposing SongComposer**: A large - language model designed specifically for song creation, which can generate melodies and lyrics using symbolic song representations, with better token efficiency, precise representation, flexible format, and human - readable output. - **Compiling the SongCompose - PT dataset**: A comprehensive pre - training dataset containing lyrics, melodies, and paired lyrics - melodies, covering Chinese and English. - **Experimental Results**: Extensive experiments show that SongComposer outperforms advanced large - language models such as GPT - 4 in tasks such as lyric - to - melody generation, melody - to - lyric generation, song continuation, and text - to - song creation. ### Summary: The paper successfully solves the application problems of large - language models in music creation by introducing the symbolic song representation method and carefully designed datasets, especially making significant progress in the co - generation of lyrics and melodies.