Abstract:Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at <a class="link-external link-https" href="https://huggingface.co/datasets/ZheqiDAI/MusicScore" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address the lack of large-scale benchmark datasets in the field of music score generation. Specifically, existing music score datasets are relatively small and primarily used for Optical Music Recognition (OMR), lacking large-scale image-text pair datasets for music modeling and generation. Therefore, the authors propose the **MusicScore** dataset, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). ### Main Contributions 1. **Proposing a Music Score Generation Dataset and Benchmark**: This is the first time a dataset and benchmark system specifically for music score generation have been proposed. 2. **Dataset Source**: Downloading, processing, and cleaning music scores and their corresponding metadata from IMSLP. 3. **Multi-Scale Dataset**: Dividing MusicScore into three versions: small-scale (400 image-text pairs), medium-scale (14,000 image-text pairs), and large-scale (200,000 image-text pairs) to meet different research needs. 4. **Rich Metadata**: Including rich metadata extracted from the general information section of IMSLP pages, such as composer, instrument, work style, etc. 5. **Text-Based Latent Diffusion Model**: Constructing a system based on a UNet diffusion model that can generate high-quality and playable music score images based on text descriptions, used to evaluate the performance of the MusicScore dataset in music score generation tasks. ### Background and Motivation Music scores are the written representation of music, containing rich information about musical elements such as notes, rests, staves, key signatures, dynamics, and playing techniques. Compared to audio and symbolic representations, the visual information of music scores has richer semantic information. However, the lack of large-scale music score datasets and benchmarks currently limits the progress of research in music score generation. ### Methods 1. **Dataset Collection**: Downloading music score PDF files from IMSLP and extracting metadata for each piece. 2. **Dataset Processing**: - **Color Depth Filtering**: Retaining black-and-white images with 1-bit color depth, removing color images. - **Non-Score Page Filtering**: Training a classification model to exclude cover pages and text description pages, ensuring each single-page score image contains only musical content. - **Manually Annotated Subset**: Creating a small-scale dataset MusicScore-400, containing 403 image-text pairs, for rapid development and testing of music score generation systems. 3. **Metadata Processing**: Extracting metadata from the general information section of IMSLP pages and storing it as JSON files. 4. **Generation System**: A text-driven latent diffusion model, including a Variational Autoencoder (VAE), text encoder, and UNet backbone network. Generating playable music score images based on text descriptions. ### Experiments and Evaluation 1. **Generation System**: Fine-tuning the Stable Diffusion model to generate music score images that match the input text descriptions. 2. **Performance Evaluation**: Using Fréchet Inception Distance (FID) to evaluate the quality of the generated music score images. Results show that the quality of the generated music score images is good across different scales of the dataset. ### Conclusion The authors successfully constructed the **MusicScore** dataset and developed a text-driven latent diffusion model-based music score generation system. This work fills a gap in the field of music score generation, providing important resources and benchmarks for related research. Future work plans include developing the MusicScore-CLIP model and integrating score, audio, and symbolic representations to create a unified music modeling and generation system.

MusicScore: A Dataset for Music Score Modeling and Generation

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset

MusicTM-Dataset for Joint Representation Learning among Sheet Music, Lyrics, and Musical Audio

The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music

Video Background Music Generation: Dataset, Method and Evaluation

DeepScores -- A Dataset for Segmentation, Detection and Classification of Tiny Objects

Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training

MidiCaps: A large-scale MIDI dataset with text captions

A Dataset for Learning Stylistic and Cultural Correlations Between Music and Videos

Deep Multilevel Cascade Residual Recurrent Framework (MCRR) for Sheet Music Recognition

Towards Musicologist-Driven Mining of Handwritten Scores.

In Search of a Dataset for Handwritten Optical Music Recognition: Introducing MUSCIMA++

Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval

ComMU: Dataset for Combinatorial Music Generation

MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

Predicting performance difficulty from piano sheet music images

ChMusic: A Traditional Chinese Music Dataset for Evaluation of Instrument Recognition

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer