Towards Accelerated Model Training via Bayesian Data Selection

Zhijie Deng,Peng Cui,Jun Zhu

2023-11-07

Abstract:Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.

Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issues related to mislabeled, duplicated, or biased data in real-world scenarios, which can prolong training and hinder model convergence. Traditional approaches that prioritize easy or hard samples lack flexibility, and recent methods that focus on the impact of data on model generalization loss require impractical approximations and additional holdout data. The authors propose a method that leverages a lightweight Bayesian treatment and incorporates off-the-shelf zero-shot predictors built on large-scale pre-trained models. This approach provides a more reasonable approximation of the generalization loss-based data selection principle without needing extra holdout data. The key contributions and objectives of the paper can be summarized as follows: 1. **Objective**: Develop an efficient and easy-to-implement algorithm for data selection that accelerates model training and improves convergence, particularly in the presence of noisy and imbalanced data. 2. **Methodology**: - **Lower Bound Derivation**: Derive a lower bound for the objective function that separates the posterior predictive distributions defined on training and holdout data. - **Zero-shot Predictors**: Utilize off-the-shelf zero-shot predictors built on large-scale pre-trained models as proxies for the holdout data, eliminating the need for extra holdout data. - **Bayesian Treatment**: Maintain a Bayesian treatment of the training model using Laplace approximation and K

Towards Accelerated Model Training via Bayesian Data Selection

Towards Bayesian Data Selection

A Bayesian Approach to Data Point Selection

Diversified Batch Selection for Training Acceleration

Multi-Label Adaptive Batch Selection by Highlighting Hard and Imbalanced Samples

Progressive Sampling-Based Bayesian Optimization for Efficient and Automatic Machine Learning Model Selection

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Data Selection for Task-Specific Model Finetuning

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Selective Data Acquisition in the Wild for Model Charging

TSDS: Data Selection for Task-Specific Model Finetuning

AdaSelection: Accelerating Deep Learning Training through Data Subsampling

Towards a statistical theory of data selection under weak supervision

Embrace Sustainable AI: Dynamic Data Subset Selection for Image Classification

A Survey on Data Selection for Language Models

A CLIP-Powered Framework for Robust and Generalizable Data Selection

A Two-Phase Recall-and-Select Framework for Fast Model Selection

Learning with Imbalanced Noisy Data by Preventing Bias in Sample Selection

Efficient Online Data Mixing For Language Model Pre-Training

Bayesian Optimization for Selecting Efficient Machine Learning Models

Information FOMO: The Unhealthy Fear of Missing Out on Information—A Method for Removing Misleading Data for Healthier Models