What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: When training language models (LMs), due to the reliance on large - scale unannotated natural language processing (NLP) datasets, the training process consumes a large amount of computing resources, is time - consuming and has low efficiency. In addition, there may be data instances of low - quality or harmful content in these datasets, which affect the training effect and reliability of the model. To solve these problems, the paper proposes a novel method for numerically evaluating the text quality in large - scale unannotated NLP datasets and assigns a "quality score" to each text instance. Through this method, low - quality text instances can be identified and eliminated, thereby improving the efficiency of LM training. Specifically, this method aims to: 1. **Establish a framework** to quantitatively evaluate text quality, ensuring that this evaluation method is model - independent and avoiding recalculating quality indicators for each model. 2. **Optimize the dataset** by removing low - quality text instances, reducing the amount of data required for training, while increasing the speed and accuracy of model training. 3. **Improve model performance** so that it can still maintain or even improve the performance of the model on downstream tasks while reducing the amount of data and training time. Through experimental verification, this method has shown significant effects on multiple models and datasets. For example, on the OpenWebText dataset, using 40% less data and 42% faster training speed, an average absolute accuracy improvement of 0.9% has been achieved; on the Wikipedia dataset, using 20% less data and 21% faster training speed, an average absolute accuracy improvement of 0.8% has been achieved. ### Formula Summary - **Weight Calculation Formula**: \[ w_i=\max\left(0, \frac{PPL_{all}-PPL_i}{PPL_{all}}\right) \] where \(w_i\) is the weight of the \(i\) - th filter, \(PPL_i\) is the perplexity of the subset after applying the \(i\) - th filter, and \(PPL_{all}\) is the perplexity of the unfiltered dataset. - **Row - level Quality Score Formula**: \[ score_{line}=\frac{\sum_{i = 1}^{F}w_iI_i(line)}{\sum_{i = 1}^{F}w_i} \] where \(score_{line}\) is the quality score of a certain row, \(w_i\) is the weight of the \(i\) - th filter, \(I_i(line)\) is an indicator function indicating whether the row meets the criteria of the \(i\) - th filter, and \(F\) is the number of filters. - **Document - level Quality Score Formula**: \[ score_{doc}=\frac{\sum_{line = 1}^{n}tcline\cdot score_{line}}{\sum_{line = 1}^{n}tcline} \] where \(score_{doc}\) is the quality score of a document, \(tcline\) is the number of tokens in a certain row, \(score_{line}\) is the quality score of a certain row, and \(n\) is the total number of rows in the document. Through these methods and formulas, the paper provides an efficient and effective framework for improving the training process of large - scale language models.

Text Quality-Based Pruning for Efficient Training of Language Models

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models

Large Language Model Pruning

Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Neural Language Model Pruning for Automatic Speech Recognition

Improving Language Model Size Reduction Using Better Pruning Criteria

Gradient-based Intra-attention Pruning on Pre-trained Language Models

Language Model-Driven Data Pruning Enables Efficient Active Learning

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

A Simple and Effective Pruning Approach for Large Language Models

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining

Pruning Foundation Models for High Accuracy without Retraining

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models