Abstract:Large language models (LLMs), such as GPT-3 and BERT, have revolutionized the field of natural language processing (NLP), offering remarkable capabilities in text generation, translation, summarization, and classification. Among their many applications, LLMs show promise in text classification tasks, where they can automatically categorize text data into predefined categories or labels. This paper presents a comprehensive review of the potential and limitations of utilizing LLMs for text classification through synthetic data generation techniques. We delve into the methodologies employed in generating synthetic data using LLMs, which include techniques such as data augmentation, adversarial training, and transfer learning. These approaches aim to address issues of data scarcity and domain adaptation in text classification tasks. We explore their effectiveness in enhancing text classification performance, demonstrating how synthetic data can improve model generalization and robustness across diverse domains and languages. Additionally, we discuss the challenges and ethical considerations associated with synthetic data generation, including issues related to data privacy, bias amplification, and model fairness. Furthermore, we examine the impact of model size, pretraining data, and fine-tuning strategies on the performance of LLMs in text classification tasks. Recent studies have shown that larger models with access to more diverse pretraining data tend to achieve higher accuracy and better generalization on downstream tasks. Fine-tuning strategies, such as curriculum learning and self-training, can further improve model performance by adapting the model to task-specific data distributions. Through a critical analysis of existing literature and empirical studies, we provide insights into the current state-of-the-art techniques, identify key research gaps, and propose future directions for advancing the utilization of LLMs in text classification through synthetic data generation. This includes exploring novel approaches for generating diverse and representative synthetic data, developing evaluation metrics for assessing the quality of synthetic data, and investigating the long-term societal impacts of deploying LLMs in real-world applications.

The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges

Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data

Annotation Guidelines-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version)

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

GPT-4 as an X data annotator: Unraveling its performance on a stance classification task

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

DADA: Deep Adversarial Data Augmentation for Extremely Low Data Regime Classification

A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs

Not Enough Data? Deep Learning to the Rescue!

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

Parrot: Multilingual Visual Instruction Tuning

LLMs Among Us: Generative AI Participating in Digital Discourse

Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels