Abstract:Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the efficiency and effectiveness problems in large - scale network data training. Specifically, the authors point out that when using vast amounts of web - crawled data for model training, most of the computing resources and time are wasted on redundant or noisy points. These points are either already learned by the model or unlearnable (i.e., noisy). This results in a long and inefficient training process. To solve this problem, the paper introduces a new data selection technique - **Reducible Holdout Loss Selection (RHO - LOSS)**. The core idea of RHO - LOSS is to select those data points that can best reduce the model generalization loss for training. Through this method, RHO - LOSS can avoid selecting already - learned, noisy or irrelevant data points, thus accelerating the training process and improving the final model performance. ### Main contributions 1. **Propose the RHO - LOSS technique**: A selection function based on probability modeling, which is used to quantify the reduction in generalization loss for each data point after training. 2. **Reduce the impact of redundant and noisy data**: RHO - LOSS can effectively avoid those data points that are already learned or noisy, and give priority to selecting valuable learning points. 3. **Accelerate training**: Experimental results show that RHO - LOSS can significantly reduce the required training steps and improve the final accuracy under multiple datasets, architectures and hyper - parameter settings. For example, on the Clothing - 1M dataset, RHO - LOSS reduces the training steps by 18 times compared to uniform random selection and improves the final accuracy by 2%. ### Formula representation The selection objective of RHO - LOSS can be expressed as: \[ \arg \max_{(x,y) \in B_t} \left[ L[y|x;D_t] - L[y|x;D_{ho}] \right] \] where: - \( L[y|x;D_t] \) is the training loss of the current model on the data point \((x, y)\). - \( L[y|x;D_{ho}] \) is the irreducible loss (IL) of the model trained on the holdout set \( D_{ho} \) on the data point \((x, y)\). In this way, RHO - LOSS ensures that the most valuable and not yet fully learned data points are selected, thus improving training efficiency and model performance.

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Prioritized training on points that are learnable, worth learning, and not yet learned (workshop version)

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

Rho-1: Not All Tokens Are What You Need

Optimizing for ROC Curves on Class-Imbalanced Data by Training over a Family of Loss Functions

RAZOR: Refining Accuracy by Zeroing Out Redundancies

Irreducible Curriculum for Language Model Pretraining

ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

LESS: Selecting Influential Data for Targeted Instruction Tuning

Multiple Independent Losses Scheduling: A Simple Training Method for Deep Neural Networks

LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

Hessian Aware Low-Rank Perturbation for Order-Robust Continual Learning

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

Efficient Loss Landscape Reshaping for Convolutional Neural Networks

LoRA Unlearns More and Retains More (Student Abstract)

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

Does your data spark joy? Performance gains from domain upsampling at the end of training

Improve Noise Tolerance of Robust Loss via Noise-Awareness