Abstract:Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at <a class="link-external link-https" href="https://github.com/jimmyxu123/SELECT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **data curation**, especially how to collect and organize samples to support efficient image - classification learning. Although data curation is a core part of machine - learning tasks, previous studies have rarely made a comprehensive and systematic comparison of it. This paper fills this gap by introducing a large - scale benchmark named **SELECT**. Specifically, the main objectives of the paper include: 1. **Formally evaluate data curation strategies**: The authors hope to provide a formal evaluation framework for data curation strategies in order to better understand the effects of different strategies. 2. **Create a new benchmark dataset**: In order to generate the baseline methods for the SELECT benchmark, the authors created a new dataset **IMAGE NET++**, which is the largest extended version of ImageNet - 1K so far. This dataset contains 5 different training - data shifts, each shift is constructed using different curation strategies, and the scale of each shift is comparable to that of ImageNet - 1K. 3. **Evaluate the effects of curation strategies**: The authors evaluated these data curation baselines in two aspects: - Train the same image - classification model from scratch using each training - data shift. - Use a fixed pre - trained self - supervised representation to examine the effects of these data shifts. ### Key findings of the paper 1. **Low - cost curation strategies fail to outperform ImageNet**: Although there are various low - cost data curation methods, they still perform worse than the original ImageNet dataset on most evaluation metrics. 2. **Embedding - search - based strategies perform best**: These strategies significantly outperform diffusion - guided curation methods in most benchmarks, even if there are difficulties in obtaining class - balanced data. 3. **Human - curated data are not always more useful**: For example, although the OI1000 dataset uses human annotations, it performs worse than LA1000(img2img) on most metrics. 4. **Bigger does not mean better**: LA1000(img2img) is one of the smallest datasets, but it performs excellently on many metrics. 5. **Image - to - image strategies are superior to text - to - image strategies**: Most metrics indicate that the img2img method is more effective than txt2img. ### Conclusion By introducing the SELECT benchmark and the IMAGE NET++ dataset, the paper systematically evaluates the performance of multiple data curation strategies in image - classification tasks. The research results reveal the limitations of existing curation methods and provide directions for future improvement of data curation techniques. ### Formula summary Some of the key formulas and metrics involved in the paper are as follows: - The definitions of **Long - tailedness** and **Left - skewedness** can be found in Appendix G. - For the specific calculation methods of quality metrics such as **CLIPScore** and **CMMD Score**, please refer to relevant literature [18, 21]. - The **Pearson correlation coefficient** is used to evaluate the correlation between different metrics, for example, \( R:P,CC \) represents the correlation between precision and the number of classes. These formulas and metrics help to quantify the effects of different data curation strategies, thus providing a basis for choosing the optimal strategy.

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

Image Classification with Small Datasets: Overview and Benchmark

CiT: Curation in Training for Effective Vision-Language Data

Deep Neural Network Benchmarks for Selective Classification

The Role of Data Curation in Image Captioning

Embrace Sustainable AI: Dynamic Data Subset Selection for Image Classification

CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection.

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Tune It or Don't Use It: Benchmarking Data-Efficient Image Classification

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

From ImageNet to Image Classification: Contextualizing Progress on Benchmarks

From MNIST to ImageNet and Back: Benchmarking Continual Curriculum Learning

ImageNet Large Scale Visual Recognition Challenge

Evaluation and benchmark for biological image segmentation

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Curating Training Data for Reliable Large-Scale Visual Data Analysis: Lessons from Identifying Trash in Street View Imagery

DCA-Bench: A Benchmark for Dataset Curation Agents

CurBench: Curriculum Learning Benchmark

DataComp: In search of the next generation of multimodal datasets

Demystifying CLIP Data