Abstract:Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the deficiencies in the development and management processes of data sets in the field of machine learning (ML), especially in aspects such as data set documentation, ethical considerations, environmental impacts, and data management. Specifically: 1. **Insufficient data set documentation**: Many data sets lack sufficient documentation, resulting in poor reproducibility and reusability. 2. **Inadequate ethical considerations**: Ethical issues, such as privacy protection and informed consent, are often not fully considered in data sets. 3. **Unquantified environmental footprint**: Most data sets do not quantify the environmental impact during the creation process. 4. **Improper data management**: There are problems in the long - term management and maintenance of data sets, and effective strategies are lacking. To address these problems, the author proposes a systematic data set documentation evaluation framework, aiming to improve the data set development practices in the NeurIPS conference through the principles of data management. This framework includes a scoring sheet and a toolkit for evaluating the quality of data set documentation and providing improvement suggestions. By evaluating 60 data sets published on the NeurIPS Datasets and Benchmarks track, the author found obvious weaknesses in the current data set development practices, especially in ethical considerations, environmental footprint, and data management. Based on these findings, the author suggests adopting targeted strategies and resources to improve the documentation in these areas and proposes improvements to the NeurIPS peer - review process to promote more rigorous data management practices. ### Specific problem summary - **Uneven documentation quality**: The quality of documentation varies greatly among different data sets. Some data sets almost meet all the minimum standards, while others are far from meeting them. - **Inadequate ethical considerations**: Many data sets lack in - depth discussion on ethics, especially in terms of context awareness and location statements. - **Unquantified environmental impact**: No data set quantifies its environmental footprint during the creation process. - **Imperfect data management**: Although some data sets perform well in basic accessibility and usability, there is still room for improvement in long - term management and maintenance. ### Improvement suggestions - **Strengthen ethical review**: Increase the focus on ethical issues and ensure that privacy, informed consent, etc. are fully considered during the data set development process. - **Quantify environmental impact**: Require data set developers to quantify their environmental footprint during the creation process. - **Raise documentation standards**: Develop more detailed and strict documentation guidelines to ensure the reproducibility and reusability of data sets. - **Optimize the peer - review process**: It is recommended that NeurIPS place more emphasis on the quality of data set documentation and data management practices during the peer - review process. Through these measures, the author hopes to promote more standardized, transparent, and responsible data set development practices in the field of machine learning.

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework

A Systematic Review of NeurIPS Dataset Management Practices

A Taxonomy of Challenges to Curating Fair Datasets

A dataset for measuring the impact of research data and their curation

Benchmark Data Repositories for Better Benchmarking

Ethical Considerations for Responsible Data Curation

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development

On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

AI Competitions and Benchmarks: Dataset Development

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators

The Role of Publicly Available Data in MICCAI Papers from 2014 to 2018

DCA-Bench: A Benchmark for Dataset Curation Agents

AI Ethics Statements -- Analysis and lessons learnt from NeurIPS Broader Impact Statements

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

Leveraging Machine Learning to Detect Data Curation Activities

The value of standards for health datasets in artificial intelligence-based applications

DataPerf: Benchmarks for Data-Centric AI Development