Abstract:Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the data generated by large - language models (LLMs) performs less well in text classification tasks than real - world data. Specifically, although the synthetic data generated by LLMs can augment the training set and thus improve the performance of downstream tasks, especially when real - world data is scarce, this generated data may deviate from the distribution of real - world data, resulting in poor performance of the model in practical applications. Therefore, the paper proposes an effective weighted - loss method, aiming to align the distribution of synthetic data with that of real - world data by emphasizing high - quality and diverse data generated by LLMs, using only a small amount of real - world data. ### Main problem summary: 1. **Data deviation**: The data generated by LLMs may deviate from real - world data, affecting the performance of the model in practical applications. 2. **Data utilization efficiency**: Traditional data filtering strategies will discard potentially valuable filtered data, while the data weighting method can make full use of all data by assigning different weights to different data points. 3. **Model performance improvement**: How to make the model achieve or even exceed the performance of training with a small amount of real - world data when training with synthetic data through an effective data weighting method. ### Solution: The paper proposes two new weighted - loss methods: - **Importance Loss (IMP - Loss)**: By calculating the importance weight of each data point, make the distribution of synthetic data closer to that of real - world data. - **Dynamic Importance Loss (DIMP - Loss)**: Dynamically adjust the weight of each data point to further optimize the model's adaptability to real - world data. ### Experimental verification: The paper verifies the effectiveness of these two methods through experiments on multiple text classification tasks. The results show that the models trained with IMP - Loss and DIMP - Loss outperform traditional cross - entropy loss and other data weighting methods in multiple benchmark tests, especially when using a large amount of synthetic data and a small amount of real - world data. ### Conclusion: By proposing IMP - Loss and DIMP - Loss, the paper provides an effective method to improve the performance of the model in practical applications when training with synthetic data generated by LLMs, while reducing the dependence on a large amount of real - world data. This provides a new solution to the problem of data scarcity.

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime

Towards Adversarially Robust Text Classifiers by Learning to Reweight Clean Examples

Importance Weighting Can Help Large Language Models Self-Improve

A Survey on Data Synthesis and Augmentation for Large Language Models

Improving Text Classification with Large Language Model-Based Data Augmentation

Large Model-Based Data Augmentation for Imbalanced Text Classification

Not Enough Data? Deep Learning to the Rescue!

Under the Surface: Tracking the Artifactuality of LLM-Generated Data