Abstract:Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human annotators. Fine-tuning models using LLM-generated labels can be a fast, efficient and cost-effective method of building supervised text classifiers.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores how to utilize generative large language models (LLMs) to replace or enhance human-annotated data, thereby improving the efficiency and cost-effectiveness of supervised text classification tasks. #### Main Research Questions: 1. **Effectiveness of Generative LLM-Annotated Data**: The researchers evaluate the effectiveness of using labels generated by generative LLMs (such as GPT-4) to fine-tune supervised classifiers and compare this with using human-annotated data. 2. **Performance Comparison**: By conducting experiments on 14 different classification tasks, the study compares the performance of several supervised classifiers (such as BERT, RoBERTa, etc.) on different quantities of human-annotated samples and GPT-4 generated samples. 3. **Application of Knowledge Distillation Methods**: The researchers also explore how to use smaller and cheaper student models (such as BERT Base) to approximate the performance of large teacher models (such as GPT-4). #### Research Background: - Supervised text classification typically relies on human-annotated datasets, but this approach is costly, time-consuming, and prone to errors. - Using generative LLMs can generate annotated data more quickly and cheaply, and in some cases, produce high-quality annotation results. #### Main Findings: - In 14 classification tasks, the performance of supervised models fine-tuned with GPT-4 generated labels was comparable to those fine-tuned with human-annotated data. - Models fine-tuned with GPT-4 generated labels performed excellently in terms of recall but were slightly inferior in precision. - Using GPT-4 generated labels can significantly reduce costs while maintaining high classification performance. #### Methodology: - The researchers first validated the performance of LLMs on a small number of human-annotated samples and adjusted prompts to optimize performance. - Using the validated prompts, they generated an additional 1000 samples for each task and then fine-tuned supervised classifiers with this data. - Finally, they evaluated the model performance on a test set of 1000 human-annotated samples and compared it with other models. Through this series of experiments, the paper demonstrates the effectiveness and practicality of generative LLMs in supervised text classification tasks, providing new research tools for future computational social science.

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

Annotation Guidelines-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification

Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale

Using Large Language Model Annotations for Valid Downstream Statistical Inference in Social Science: Design-Based Semi-Supervised Learning

Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels

Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation

A Survey on Knowledge Distillation of Large Language Models

Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation

Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels

Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

Knowledge Supervised Text Classification with No Labeled Documents

Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks

Generate, Annotate, and Learn: NLP with Synthetic Text

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges

Can Unconfident LLM Annotations Be Used for Confident Conclusions?