Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo,Yin-Hsiang Liao,Yu-Chieh Chao,Wei-Yun Ma,Pu-Jen Cheng
2024-10-29
Abstract:Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the data generated by large - language models (LLMs) performs less well in text classification tasks than real - world data. Specifically, although the synthetic data generated by LLMs can augment the training set and thus improve the performance of downstream tasks, especially when real - world data is scarce, this generated data may deviate from the distribution of real - world data, resulting in poor performance of the model in practical applications. Therefore, the paper proposes an effective weighted - loss method, aiming to align the distribution of synthetic data with that of real - world data by emphasizing high - quality and diverse data generated by LLMs, using only a small amount of real - world data. ### Main problem summary: 1. **Data deviation**: The data generated by LLMs may deviate from real - world data, affecting the performance of the model in practical applications. 2. **Data utilization efficiency**: Traditional data filtering strategies will discard potentially valuable filtered data, while the data weighting method can make full use of all data by assigning different weights to different data points. 3. **Model performance improvement**: How to make the model achieve or even exceed the performance of training with a small amount of real - world data when training with synthetic data through an effective data weighting method. ### Solution: The paper proposes two new weighted - loss methods: - **Importance Loss (IMP - Loss)**: By calculating the importance weight of each data point, make the distribution of synthetic data closer to that of real - world data. - **Dynamic Importance Loss (DIMP - Loss)**: Dynamically adjust the weight of each data point to further optimize the model's adaptability to real - world data. ### Experimental verification: The paper verifies the effectiveness of these two methods through experiments on multiple text classification tasks. The results show that the models trained with IMP - Loss and DIMP - Loss outperform traditional cross - entropy loss and other data weighting methods in multiple benchmark tests, especially when using a large amount of synthetic data and a small amount of real - world data. ### Conclusion: By proposing IMP - Loss and DIMP - Loss, the paper provides an effective method to improve the performance of the model in practical applications when training with synthetic data generated by LLMs, while reducing the dependence on a large amount of real - world data. This provides a new solution to the problem of data scarcity.