Abstract:Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.

Efficient Coreset Selection with Cluster-Based Methods

Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning.

GoodCore: Data-effective and Data-efficient Machine Learning Through Coreset Selection over Incomplete Data.

Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection

A Novel Sequential Coreset Method For Gradient Descent Algorithms

TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Coreset Stochastic Variance-Reduced Gradient with Application to Optimal Margin Distribution Machine

Mind the Boundary: Coreset Selection Via Reconstructing the Decision Boundary

Coresets for Data-efficient Training of Machine Learning Models

Cost-sensitive Regression Learning on Small Dataset Through Intra-Cluster Product Favoured Feature Selection

DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning

Contributing Dimension Structure of Deep Feature for Coreset Selection

The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection

Gradient Coreset for Federated Learning

GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training

A Two-Phase Recall-and-Select Framework for Fast Model Selection

A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling

Probabilistic Bilevel Coreset Selection

A CLIP-Powered Framework for Robust and Generalizable Data Selection