DISCO-10M: A Large-Scale Music Dataset

Luca A. Lanzendörfer,Florian Grötschla,Emil Funke,Roger Wattenhofer
2023-10-05
Abstract:Music datasets play a crucial role in advancing research in machine learning for music. However, existing music datasets suffer from limited size, accessibility, and lack of audio resources. To address these shortcomings, we present DISCO-10M, a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude. To ensure high-quality data, we implement a multi-stage filtering process. This process incorporates similarities based on textual descriptions and audio embeddings. Moreover, we provide precomputed CLAP embeddings alongside DISCO-10M, facilitating direct application on various downstream tasks. These embeddings enable efficient exploration of machine learning applications on the provided data. With DISCO-10M, we aim to democratize and facilitate new research to help advance the development of novel machine learning models for music.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address issues related to the scale, accessibility, and lack of audio resources in music datasets. Specifically: 1. **Limited Scale**: Existing music datasets are relatively small, limiting the diversity and representativeness of music content, as well as the research scenarios that can be addressed. 2. **Poor Accessibility**: Many state-of-the-art audio and music models are trained on proprietary datasets that are not available to the broader research community. 3. **Lack of Audio Resources**: The scarcity of audio recordings is a major obstacle to building new music machine learning models. To address these issues, the authors propose a new large-scale music dataset named **DISCO-10M**, which is an order of magnitude larger than the largest existing music datasets. Through a carefully designed multi-stage filtering process, data quality is ensured, and precomputed CLAP embeddings are provided to facilitate direct application in various downstream tasks. The goal of DISCO-10M is to democratize access to large and high-quality music data for the research community, fostering new research and development, and advancing the development of machine learning models in the music domain.