Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar,Omar Benjelloun,Costanza Conforti,Pieter Gijsbers,Joan Giner-Miguelez,Nitisha Jain,Michael Kuchnik,Quentin Lhoest,Pierre Marcenac,Manil Maskey,Peter Mattson,Luis Oala,Pierre Ruyssen,Rajat Shinde,Elena Simperl,Goeffry Thomas,Slava Tykhonov,Joaquin Vanschoren,Jos van der Velde,Steffen Vogler,Carole-Jean Wu

DOI: https://doi.org/10.1145/3650203.3663326

2024-05-31

Abstract:Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

Machine Learning,Artificial Intelligence,Databases,Information Retrieval

What problem does this paper attempt to address?

This paper proposes a metadata format called Croissant, aiming to address key issues in data management in machine learning (ML). In the field of ML, data is a crucial resource, but the process of handling data is often time-consuming and challenging due to various data formats, incompatibility between tools, and difficulties in discovering and combining datasets. Furthermore, the use of data in training and evaluating ML models has raised discussions on responsible AI, including topics such as licensing, privacy, and bias. The goal of the Croissant metadata format is to improve the discoverability, portability, and interoperability of datasets, enabling them to be directly loaded into ML frameworks and tools. It describes the attributes of the dataset, the resources it contains, and their structure and semantics, simplifying their usage and sharing, and promoting responsible AI practices. Croissant supports various common types of data, such as images, text, and audio, by providing a unified "view" to handle different formats and layouts of data, allowing users to add semantic descriptions and ML-specific information. The paper also introduces the integration of Croissant with other data repositories such as HuggingFace, Kaggle, and OpenML, along with open-source reference implementations including loaders and editors. Additionally, Croissant supports document extensions for Responsible AI to promote data transparency and accountability. In summary, Croissant addresses the problem by providing a standardized metadata format to simplify ML data management, enhance the discovery, sharing, and reuse of datasets, while considering responsible use of data.

Croissant: A Metadata Format for ML-Ready Datasets

A Standardized Machine-readable Dataset Documentation Format for Responsible AI

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Packaging research artefacts with RO-Crate

A domain-specific language for describing machine learning datasets

From Planning Stage To FAIR Data: A Practical Metadatasheet For Biomedical Scientists

From Planning Stage Towards FAIR Data: A Practical Metadatasheet For Biomedical Scientists

A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset (MedIMeta)

Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics

Metadata harmonization-Standards are the key for a better usage of omics data for integrative microbiome analysis

Making Metadata More FAIR Using Large Language Models

Common Metadata Framework: Integrated Framework for Trustworthy AI Pipelines

FAIRification of MLC data

Shallow Angle Wave Profiling LIDAR

The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards

DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

Associate and baccalaureate degree preparation for future practice of psychiatric-mental health nursing.

Shared Metadata for Data-Centric Materials Science

A metadata framework for computational phenotypes

From Raw Data to Data Standards through Quality Assessment and Semantic Annotation