MusicScore: A Dataset for Music Score Modeling and Generation

Yuheng Lin,Zheqi Dai,Qiuqiang Kong
2024-06-17
Abstract:Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at <a class="link-external link-https" href="https://huggingface.co/datasets/ZheqiDAI/MusicScore" rel="external noopener nofollow">this https URL</a>.
Multimedia,Graphics,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address the lack of large-scale benchmark datasets in the field of music score generation. Specifically, existing music score datasets are relatively small and primarily used for Optical Music Recognition (OMR), lacking large-scale image-text pair datasets for music modeling and generation. Therefore, the authors propose the **MusicScore** dataset, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). ### Main Contributions 1. **Proposing a Music Score Generation Dataset and Benchmark**: This is the first time a dataset and benchmark system specifically for music score generation have been proposed. 2. **Dataset Source**: Downloading, processing, and cleaning music scores and their corresponding metadata from IMSLP. 3. **Multi-Scale Dataset**: Dividing MusicScore into three versions: small-scale (400 image-text pairs), medium-scale (14,000 image-text pairs), and large-scale (200,000 image-text pairs) to meet different research needs. 4. **Rich Metadata**: Including rich metadata extracted from the general information section of IMSLP pages, such as composer, instrument, work style, etc. 5. **Text-Based Latent Diffusion Model**: Constructing a system based on a UNet diffusion model that can generate high-quality and playable music score images based on text descriptions, used to evaluate the performance of the MusicScore dataset in music score generation tasks. ### Background and Motivation Music scores are the written representation of music, containing rich information about musical elements such as notes, rests, staves, key signatures, dynamics, and playing techniques. Compared to audio and symbolic representations, the visual information of music scores has richer semantic information. However, the lack of large-scale music score datasets and benchmarks currently limits the progress of research in music score generation. ### Methods 1. **Dataset Collection**: Downloading music score PDF files from IMSLP and extracting metadata for each piece. 2. **Dataset Processing**: - **Color Depth Filtering**: Retaining black-and-white images with 1-bit color depth, removing color images. - **Non-Score Page Filtering**: Training a classification model to exclude cover pages and text description pages, ensuring each single-page score image contains only musical content. - **Manually Annotated Subset**: Creating a small-scale dataset MusicScore-400, containing 403 image-text pairs, for rapid development and testing of music score generation systems. 3. **Metadata Processing**: Extracting metadata from the general information section of IMSLP pages and storing it as JSON files. 4. **Generation System**: A text-driven latent diffusion model, including a Variational Autoencoder (VAE), text encoder, and UNet backbone network. Generating playable music score images based on text descriptions. ### Experiments and Evaluation 1. **Generation System**: Fine-tuning the Stable Diffusion model to generate music score images that match the input text descriptions. 2. **Performance Evaluation**: Using Fréchet Inception Distance (FID) to evaluate the quality of the generated music score images. Results show that the quality of the generated music score images is good across different scales of the dataset. ### Conclusion The authors successfully constructed the **MusicScore** dataset and developed a text-driven latent diffusion model-based music score generation system. This work fills a gap in the field of music score generation, providing important resources and benchmarks for related research. Future work plans include developing the MusicScore-CLIP model and integrating score, audio, and symbolic representations to create a unified music modeling and generation system.