Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar,Omar Benjelloun,Costanza Conforti,Pieter Gijsbers,Joan Giner-Miguelez,Nitisha Jain,Michael Kuchnik,Quentin Lhoest,Pierre Marcenac,Manil Maskey,Peter Mattson,Luis Oala,Pierre Ruyssen,Rajat Shinde,Elena Simperl,Goeffry Thomas,Slava Tykhonov,Joaquin Vanschoren,Jos van der Velde,Steffen Vogler,Carole-Jean Wu
DOI: https://doi.org/10.1145/3650203.3663326
2024-05-31
Abstract:Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.
Machine Learning,Artificial Intelligence,Databases,Information Retrieval
What problem does this paper attempt to address?
This paper proposes a metadata format called Croissant, aiming to address key issues in data management in machine learning (ML). In the field of ML, data is a crucial resource, but the process of handling data is often time-consuming and challenging due to various data formats, incompatibility between tools, and difficulties in discovering and combining datasets. Furthermore, the use of data in training and evaluating ML models has raised discussions on responsible AI, including topics such as licensing, privacy, and bias. The goal of the Croissant metadata format is to improve the discoverability, portability, and interoperability of datasets, enabling them to be directly loaded into ML frameworks and tools. It describes the attributes of the dataset, the resources it contains, and their structure and semantics, simplifying their usage and sharing, and promoting responsible AI practices. Croissant supports various common types of data, such as images, text, and audio, by providing a unified "view" to handle different formats and layouts of data, allowing users to add semantic descriptions and ML-specific information. The paper also introduces the integration of Croissant with other data repositories such as HuggingFace, Kaggle, and OpenML, along with open-source reference implementations including loaders and editors. Additionally, Croissant supports document extensions for Responsible AI to promote data transparency and accountability. In summary, Croissant addresses the problem by providing a standardized metadata format to simplify ML data management, enhance the discovery, sharing, and reuse of datasets, while considering responsible use of data.