Abstract:High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground-truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based scores. In this paper, we introduce ELFS, a novel label-free coreset selection method. ELFS employs deep clustering to estimate data difficulty scores without ground-truth labels. Furthermore, ELFS uses a simple but effective double-end pruning method to mitigate bias on calculated scores, which further improves the performance on selected coresets. We evaluate ELFS on five vision benchmarks and show that ELFS consistently outperforms SOTA label-free baselines. For instance, at a 90% pruning rate, ELFS surpasses the best-performing baseline by 5.3% on CIFAR10 and 7.1% on CIFAR100. Moreover, ELFS even achieves comparable performance to supervised coreset selection at low pruning rates (e.g., 30% and 50%) on CIFAR10 and ImageNet-1K.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Under a limited manual annotation budget, how to select an informative and representative subset from a large unannotated dataset for annotation, so as to reduce the workload of manual annotation and improve the performance of the model. The existing label - free core - set selection methods cannot accurately estimate the data difficulty due to relying on geometric property scoring, resulting in poor selection effects. Although the supervised core - set selection methods have better performance, they require the true labels of the entire dataset, which violates the original intention of reducing the burden of manual annotation. To solve this problem, the authors propose ELFS (Enhancing Label - Free Coreset Selection via Clustering - based Pseudo - Labeling), a new label - free core - set selection method. ELFS is implemented through the following steps: 1. **Pseudo - label Generation**: Use deep clustering technology to assign pseudo - labels to each data point. 2. **Training Dynamic Scoring**: Calculate training dynamic scores (such as AUM and forgetting scores) based on the pseudo - labels. These scores can reflect the difficulty of the data. 3. **Two - End Pruning**: Propose a simple two - end pruning method to reduce the bias in selection, thereby further improving the quality of the selected core - set. Through these steps, ELFS can effectively select high - quality core - sets without using true labels, thus significantly outperforming the existing label - free core - set selection methods and, in some cases, approaching or exceeding the performance of the supervised methods. ### Formula Representation In terms of formulas, the main formulas involved in the paper include: - **Optimization Problem Formula**: \[ S^*=\arg\min_{S\subset D:|S| = k}\mathbb{E}_{x,y\sim P}[l(x,y;h_S)] \] where \(P\) is the distribution of dataset \(D\), \(l\) is the loss function, and \(h_S\) is the model trained with the annotated subset \(S\). - **TEMI Loss Function**: \[ L_{\text{TEMI}}(x)=-\frac{1}{2H}\sum_{h = 1}^{H}\sum_{x'\in N_x}w_h(x,x')\cdot(\text{pmi}_h(x,x')+\text{pmi}_h(x',x)) \] where \(\text{pmi}(x,x')=\log\left(\frac{\sum_{c = 1}^C(q_s(c|x)q_t(c|x'))^\beta}{q_t(c)}\right)\), \(w(x,x')=\sum_{c = 1}^Cq_t(c|x)q_t(c|x')\). These formulas ensure that the mathematical expressions in the paper are clear and easy to understand.

ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling

Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data

DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning

Learning to Label with Active Learning and Reinforcement Learning.

Multi-label feature selection with high-sparse personalized and low-redundancy shared common features

Speculative Coreset Selection for Task-Specific Fine-tuning

Hierarchical Equalization Loss for Long-Tailed Instance Segmentation

An Optimized Run-Length Based Algorithm for Sparse Remote Sensing Image Labeling

A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

Label Smarter, Not Harder: CleverLabel for Faster Annotation of Ambiguous Image Classification with Higher Quality

LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning

Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection

Integrating Deep Metric Learning with Coreset for Active Learning in 3D Segmentation

Tackling Noisy Clients in Federated Learning with End-to-end Label Correction

CAFS: Class Adaptive Framework for Semi-Supervised Semantic Segmentation

Roll With the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning

Data : Labeler 1 : Labeler 2 : Labeler 3 : Figure

Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation

An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling

Adaptive Model Scheduling for Resource-efficient Data Labeling