Abstract:Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the ultimate goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets~\citegebru2018datasheets. Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks such as more actionable guidance on how the characteristics of datasets might result in harms and how these harms might be mitigated, more explicit prompts for reflection, automated adaptation to different contexts, and integration into ML practitioners' existing tools and workflows.

Systematic analysis of 32,111 AI model cards characterizes documentation practice in AI

What's documented in AI? Systematic Analysis of 32K AI Model Cards

Automatic Generation of Model and Data Cards: A Step Towards Responsible AI

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Documenting Ethical Considerations in Open Source AI Models

Interactive Model Cards: A Human-Centered Approach to Model Documentation

AI Cards: Towards an Applied Framework for Machine-Readable AI and Risk Documentation Inspired by the EU AI Act

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

More than Model Documentation: Uncovering Teachers' Bespoke Information Needs for Informed Classroom Integration of ChatGPT

Evaluating a Methodology for Increasing AI Transparency: A Case Study

AI Usage Cards: Responsibly Reporting AI-generated Content

Dynamic Documentation for AI Systems

A Methodology for Creating AI FactSheets

Model Reporting for Certifiable AI: A Proposal from Merging EU Regulation into AI Development

Documentation Practices of Artificial Intelligence

The Model Card Authoring Toolkit: Toward Community-centered, Deliberation-driven AI Design.

Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence

Unlocking Model Insights: A Dataset for Automated Model Card Generation

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Improving governance outcomes through AI documentation: Bridging theory and practice