Vasu Sharma,Karthik Padthe,Newsha Ardalani,Kushal Tirumala,Russell Howes,Hu Xu,Po-Yao Huang,Shang-Wen Li,Armen Aghajanyan,Gargi Ghosh,Luke Zettlemoyer
Abstract:In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score".
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: When training language models (LMs), due to the reliance on large - scale unannotated natural language processing (NLP) datasets, the training process consumes a large amount of computing resources, is time - consuming and has low efficiency. In addition, there may be data instances of low - quality or harmful content in these datasets, which affect the training effect and reliability of the model.
To solve these problems, the paper proposes a novel method for numerically evaluating the text quality in large - scale unannotated NLP datasets and assigns a "quality score" to each text instance. Through this method, low - quality text instances can be identified and eliminated, thereby improving the efficiency of LM training. Specifically, this method aims to:
1. **Establish a framework** to quantitatively evaluate text quality, ensuring that this evaluation method is model - independent and avoiding recalculating quality indicators for each model.
2. **Optimize the dataset** by removing low - quality text instances, reducing the amount of data required for training, while increasing the speed and accuracy of model training.
3. **Improve model performance** so that it can still maintain or even improve the performance of the model on downstream tasks while reducing the amount of data and training time.
Through experimental verification, this method has shown significant effects on multiple models and datasets. For example, on the OpenWebText dataset, using 40% less data and 42% faster training speed, an average absolute accuracy improvement of 0.9% has been achieved; on the Wikipedia dataset, using 20% less data and 21% faster training speed, an average absolute accuracy improvement of 0.8% has been achieved.
### Formula Summary
- **Weight Calculation Formula**:
\[
w_i=\max\left(0, \frac{PPL_{all}-PPL_i}{PPL_{all}}\right)
\]
where \(w_i\) is the weight of the \(i\) - th filter, \(PPL_i\) is the perplexity of the subset after applying the \(i\) - th filter, and \(PPL_{all}\) is the perplexity of the unfiltered dataset.
- **Row - level Quality Score Formula**:
\[
score_{line}=\frac{\sum_{i = 1}^{F}w_iI_i(line)}{\sum_{i = 1}^{F}w_i}
\]
where \(score_{line}\) is the quality score of a certain row, \(w_i\) is the weight of the \(i\) - th filter, \(I_i(line)\) is an indicator function indicating whether the row meets the criteria of the \(i\) - th filter, and \(F\) is the number of filters.
- **Document - level Quality Score Formula**:
\[
score_{doc}=\frac{\sum_{line = 1}^{n}tcline\cdot score_{line}}{\sum_{line = 1}^{n}tcline}
\]
where \(score_{doc}\) is the quality score of a document, \(tcline\) is the number of tokens in a certain row, \(score_{line}\) is the quality score of a certain row, and \(n\) is the total number of rows in the document.
Through these methods and formulas, the paper provides an efficient and effective framework for improving the training process of large - scale language models.