Abstract:Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.

Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset

Text2midi: Generating Symbolic Music from Captions

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

MuseCoco: Generating Symbolic Music from Text

MusicScore: A Dataset for Music Score Modeling and Generation

MidiCaps: A large-scale MIDI dataset with text captions

MuPT: A Generative Symbolic Music Pretrained Transformer

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

LP-MusicCaps: LLM-Based Pseudo Music Captioning

SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints

MusicLM: Generating Music From Text

Flexible Control in Symbolic Music Generation via Musical Metadata

Learning to Generate Music With Sentiment

Retrieval Augmented Generation of Symbolic Music with LLMs

ChatMusician: Understanding and Generating Music Intrinsically with LLM

SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit