Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Nicholas Pangakis,Samuel Wolken
2024-06-25
Abstract:Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human annotators. Fine-tuning models using LLM-generated labels can be a fast, efficient and cost-effective method of building supervised text classifiers.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily explores how to utilize generative large language models (LLMs) to replace or enhance human-annotated data, thereby improving the efficiency and cost-effectiveness of supervised text classification tasks. #### Main Research Questions: 1. **Effectiveness of Generative LLM-Annotated Data**: The researchers evaluate the effectiveness of using labels generated by generative LLMs (such as GPT-4) to fine-tune supervised classifiers and compare this with using human-annotated data. 2. **Performance Comparison**: By conducting experiments on 14 different classification tasks, the study compares the performance of several supervised classifiers (such as BERT, RoBERTa, etc.) on different quantities of human-annotated samples and GPT-4 generated samples. 3. **Application of Knowledge Distillation Methods**: The researchers also explore how to use smaller and cheaper student models (such as BERT Base) to approximate the performance of large teacher models (such as GPT-4). #### Research Background: - Supervised text classification typically relies on human-annotated datasets, but this approach is costly, time-consuming, and prone to errors. - Using generative LLMs can generate annotated data more quickly and cheaply, and in some cases, produce high-quality annotation results. #### Main Findings: - In 14 classification tasks, the performance of supervised models fine-tuned with GPT-4 generated labels was comparable to those fine-tuned with human-annotated data. - Models fine-tuned with GPT-4 generated labels performed excellently in terms of recall but were slightly inferior in precision. - Using GPT-4 generated labels can significantly reduce costs while maintaining high classification performance. #### Methodology: - The researchers first validated the performance of LLMs on a small number of human-annotated samples and adjusted prompts to optimize performance. - Using the validated prompts, they generated an additional 1000 samples for each task and then fine-tuned supervised classifiers with this data. - Finally, they evaluated the model performance on a test set of 1000 human-annotated samples and compared it with other models. Through this series of experiments, the paper demonstrates the effectiveness and practicality of generative LLMs in supervised text classification tasks, providing new research tools for future computational social science.