The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Eshta Bhardwaj,Harshit Gujral,Siyi Wu,Ciara Zogheib,Tegan Maharaj,Christoph Becker
2024-10-30
Abstract:Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.
Computers and Society
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the deficiencies in the development and management processes of data sets in the field of machine learning (ML), especially in aspects such as data set documentation, ethical considerations, environmental impacts, and data management. Specifically: 1. **Insufficient data set documentation**: Many data sets lack sufficient documentation, resulting in poor reproducibility and reusability. 2. **Inadequate ethical considerations**: Ethical issues, such as privacy protection and informed consent, are often not fully considered in data sets. 3. **Unquantified environmental footprint**: Most data sets do not quantify the environmental impact during the creation process. 4. **Improper data management**: There are problems in the long - term management and maintenance of data sets, and effective strategies are lacking. To address these problems, the author proposes a systematic data set documentation evaluation framework, aiming to improve the data set development practices in the NeurIPS conference through the principles of data management. This framework includes a scoring sheet and a toolkit for evaluating the quality of data set documentation and providing improvement suggestions. By evaluating 60 data sets published on the NeurIPS Datasets and Benchmarks track, the author found obvious weaknesses in the current data set development practices, especially in ethical considerations, environmental footprint, and data management. Based on these findings, the author suggests adopting targeted strategies and resources to improve the documentation in these areas and proposes improvements to the NeurIPS peer - review process to promote more rigorous data management practices. ### Specific problem summary - **Uneven documentation quality**: The quality of documentation varies greatly among different data sets. Some data sets almost meet all the minimum standards, while others are far from meeting them. - **Inadequate ethical considerations**: Many data sets lack in - depth discussion on ethics, especially in terms of context awareness and location statements. - **Unquantified environmental impact**: No data set quantifies its environmental footprint during the creation process. - **Imperfect data management**: Although some data sets perform well in basic accessibility and usability, there is still room for improvement in long - term management and maintenance. ### Improvement suggestions - **Strengthen ethical review**: Increase the focus on ethical issues and ensure that privacy, informed consent, etc. are fully considered during the data set development process. - **Quantify environmental impact**: Require data set developers to quantify their environmental footprint during the creation process. - **Raise documentation standards**: Develop more detailed and strict documentation guidelines to ensure the reproducibility and reusability of data sets. - **Optimize the peer - review process**: It is recommended that NeurIPS place more emphasis on the quality of data set documentation and data management practices during the peer - review process. Through these measures, the author hopes to promote more standardized, transparent, and responsible data set development practices in the field of machine learning.