Abstract:Human annotation of training samples is expensive, laborious, and sometimes challenging, especially for Natural Language Processing (NLP) tasks. To reduce the labeling cost and enhance the sample efficiency, Active Learning (AL) technique can be used to label as few samples as possible to reach a reasonable or similar results. To reduce even more costs and with the significant advances of Large Language Models (LLMs), LLMs can be a good candidate to annotate samples. This work investigates the accuracy and cost of using LLMs (GPT-3.5 and GPT-4) to label samples on 3 different datasets. A consistency-based strategy is proposed to select samples that are potentially incorrectly labeled so that human annotations can be used for those samples in AL settings, and we call it mixed annotation strategy. Then we test performance of AL under two different settings: (1) using human annotations only; (2) using the proposed mixed annotation strategy. The accuracy of AL models under 3 AL query strategies are reported on 3 text classification datasets, i.e., AG's News, TREC-6, and Rotten Tomatoes. On AG's News and Rotten Tomatoes, the models trained with the mixed annotation strategy achieves similar or better results compared to that with human annotations. The method reveals great potentials of LLMs as annotators in terms of accuracy and cost efficiency in active learning settings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the cost of manually labeled training samples and improve sample efficiency in natural language processing (NLP) tasks. Specifically, the author explores how to use large - language models (LLMs), especially GPT - 3.5 and GPT - 4, to label samples in the active learning (AL) framework to reduce costs and maintain or improve model performance. ### Main research objectives: 1. **Evaluate the accuracy and cost of using LLMs for annotation**: By testing the annotation performance of GPT - 3.5 and GPT - 4 on three different datasets, evaluate their accuracy and cost - effectiveness. 2. **Propose a consistency strategy**: In order to identify samples that may be mis - labeled, a consistency - based strategy is proposed so that manual annotation can be used on these samples. 3. **Compare the effects of different annotation strategies**: In the active learning setting, compare the effects of using only manual annotation and mixed annotation (combining LLMs and manual annotation). ### Research background: - **High annotation cost**: Manually labeling training samples is not only expensive but also time - consuming, especially in NLP tasks. - **Active learning (AL)**: By selecting the most representative and informative samples for annotation, the number of annotated samples is reduced, thereby reducing the overall cost. - **Large - language models (LLMs)**: In recent years, LLMs such as GPT - 3.5 and GPT - 4 have performed well in various NLP tasks and have zero - shot and few - shot learning capabilities and can be used for annotation tasks. ### Experimental design: 1. **Experiment 1**: Evaluate the annotation accuracy and cost of GPT - 3.5 and GPT - 4 on different datasets. - **Datasets**: AG’s News, TREC - 6, Rotten Tomatoes. - **Methods**: Use different numbers of demonstration examples and different demonstration example selection strategies (random, minimum token, maximum similarity). - **Metrics**: Annotation accuracy, actual cost, inconsistency rate. 2. **Experiment 2**: In the active learning setting, compare the effects of using only manual annotation and mixed annotation. - **Datasets**: The same as above. - **Model**: DistilRoBERTa. - **Query strategies**: Random, minimum confidence, break - tie. - **Metrics**: Average accuracy and AUC on the test set. ### Main findings: - **Cost - efficiency**: The cost of GPT - 3.5 is much lower than that of manual annotation, while the cost of GPT - 4 is higher but its performance is better. - **Annotation accuracy**: On the AG’s News and TREC - 6 datasets, the annotation accuracy of GPT - 4 is higher, sometimes approaching 100%; but on the Rotten Tomatoes dataset, the accuracy of GPT - 4 is slightly lower than that of GPT - 3.5. - **Inconsistency rate**: The inconsistency rate of GPT - 3.5 on the TREC - 6 dataset is high, and more manual annotation is required to correct errors. ### Conclusions: - **Effectiveness of the mixed annotation strategy**: On the AG’s News and Rotten Tomatoes datasets, the mixed annotation strategy (combining GPT - 3.5 and manual annotation) has achieved results equivalent to or better than using only manual annotation. - **Future work**: Further study the performance of LLMs on more complex datasets and how to detect and correct the model's mis - labeled samples more effectively. This paper provides a valuable reference for the data annotation process in future NLP tasks, especially in terms of cost control and performance optimization.

Active Learning for NLP with Large Language Models

FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models

Human-centred Design on Crowdsourcing Annotation Towards Improving Active Learning Model Performance

Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation

LLMaAA: Making Large Language Models as Active Annotators

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

Human still wins over llm: An empirical study of active learning on domain-specific annotation tasks

Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

Learning to Label with Active Learning and Reinforcement Learning.

Large Language Models for Data Annotation: A Survey

ActiveLab: Active Learning with Re-Labeling by Multiple Annotators

Annotation Guidelines-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification

Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency

Adversarial active learning for the identification of medical concepts and annotation inconsistency

A Survey on Cost Types, Interaction Schemes, and Annotator Performance Models in Selection Algorithms for Active Learning in Classification

ActiveLLM: Large Language Model-based Active Learning for Textual Few-Shot Scenarios

CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation

Large Language Models for Data Annotation and Synthesis: A Survey

Active Learning with Label Quality Control