Abstract:Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.

What problem does this paper attempt to address?

The paper focuses on the core set selection problem in deep learning, aiming to select a subset of training samples for efficient learning. Current methods typically rely on similarity measures to evaluate the representativeness and diversity of samples, such as using the L2 norm. However, these methods have limitations in evaluating diversity, as they simply aggregate dimension similarities and overlook the differences between dimensions that significantly contribute to the final similarity. The paper proposes a new measure called Contributing Dimension Structure (CDS) and a feature-based CDS diversity constraint to enhance the diversity of the selected subset. The CDS measure not only considers the removal of redundant dimensions but also focuses on the differences between dimensions that have significant impact on the final similarity. The authors found that existing methods tend to select samples with the same CDS, leading to a reduction in CDS diversity in the core set, which affects the model's performance. To address this issue, the paper introduces CDS constraints that enforce the selection of maximum diversity in the subset. Specifically, the CDS constraint enhances the performance of five classical selection methods by ensuring the selection of as many different CDS as possible. Experimental results show that the proposed CDS method improves the performance of existing methods on three image classification datasets. In summary, the paper aims to address how to effectively capture and utilize the diversity and representativeness of data in sample selection for deep learning. It optimizes the core set selection process by introducing the concept of CDS and corresponding constraints, thereby improving the learning effectiveness of models with limited samples.

Contributing Dimension Structure of Deep Feature for Coreset Selection

DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning

Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection

Efficient Coreset Selection with Cluster-Based Methods

Selecting Features by their Resilience to the Curse of Dimensionality

Mind the Boundary: Coreset Selection Via Reconstructing the Decision Boundary

CONDEN-FI: Consistency and Diversity Learning-based Multi-View Unsupervised Feature and In-stance Co-Selection

Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning.

Selection of diverse features with a diverse regularization

Deep Diversity-Enhanced Feature Representation of Hyperspectral Images

Dual-Enhanced Coreset Selection with Class-wise Collaboration for Online Blurry Class Incremental Learning

Coordinating Discernibility And Independence Scores Of Variables In A 2d Space For Efficient And Accurate Feature Selection

Sparse and Low-Redundant Subspace Learning-Based Dual-Graph Regularized Robust Feature Selection

Emphasizing Closeness and Diversity Simultaneously for Deep Face Representation.

Supervised Feature Selection via Collaborative Neurodynamic Optimization

Unsupervised Feature Selection Via Diversity-induced Self-representation.

UDSFS: Unsupervised Deep Sparse Feature Selection

Exploring Learning with Deep Heterogeneous Descriptor-based Sampling

Discrimination Structure Complementarity-Based Feature Selection.

Clustering-based feature subset selection with analysis on the redundancy–complementarity dimension

The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection