Abstract:Prevalent supervised learning methods in natural language processing (NLP) are notoriously data-hungry, which demand large amounts of high-quality annotated data. In practice, acquiring such data is a costly endeavor. Recently, the superior few-shot performance of large language models (LLMs) has propelled the development of dataset generation, where the training data are solely synthesized from LLMs. However, such an approach usually suffers from low-quality issues, and requires orders of magnitude more labeled data to achieve satisfactory performance. To fully exploit the potential of LLMs and make use of massive unlabeled data, we propose LLMaAA, which takes LLMs as annotators and puts them into an active learning loop to determine what to annotate efficiently. To learn robustly with pseudo labels, we optimize both the annotation and training processes: (1) we draw k-NN examples from a small demonstration pool as in-context examples, and (2) we adopt the example reweighting technique to assign training samples with learnable weights. Compared with previous approaches, LLMaAA features both efficiency and reliability. We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction. With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples, which is much more cost-effective than other baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently and reliably use large - scale language models (LLMs) to generate labeled data in natural language processing (NLP) tasks, so as to reduce the need for a large amount of high - quality manually - labeled data. Specifically, the paper proposes a framework named LLM AAA (Making Large Language Models as Active Annotators), aiming to improve the efficiency and quality of data annotation by using LLMs as active annotators. This framework mainly solves the following key problems: 1. **High cost of data annotation**: Traditional supervised learning methods in NLP tasks require a large amount of high - quality labeled data, and obtaining such data is usually very expensive. LLM AAA reduces the dependence on manual annotation by using LLMs to automatically generate labeled data, thereby reducing the cost of data annotation. 2. **Quality problems of generated data**: Previous methods of using LLMs to generate training data usually encounter the problem of low quality and require a large amount of labeled data to achieve satisfactory results. LLM AAA improves the quality of generated data by optimizing the annotation and training processes, so that only a small amount of labeled data is required to train a task - specific model (TAMs) with good performance. 3. **Data efficiency and model performance**: LLM AAA not only improves the efficiency of data annotation, but also ensures the performance of the model. Experimental results show that the TAMs trained with the labeled data generated by LLM AAA can outperform their teacher LLMs with only a few hundred labeled samples, and are significantly better than other data generation methods. 4. **Data privacy and security**: In the "language model as a service" (LMaaS) setting, users need to provide data containing sensitive or private information to third - party LLM suppliers, which increases the risk of data leakage. LLM AAA alleviates concerns about data privacy and security by reducing the dependence on large - scale synthetic data. In summary, the LLM AAA framework aims to solve the problems of high cost of data annotation, low quality of generated data, and data privacy and security in NLP tasks through efficient active learning and optimized data generation strategies, thus providing a more practical and cost - effective solution for the deployment of large - scale language models in practical applications.

LLMaAA: Making Large Language Models as Active Annotators

AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

Large Language Models for Data Annotation: A Survey

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models

Active Learning for NLP with Large Language Models

Automatic Text Labeling Method Based on Large Language Models

Entity Alignment with Noisy Annotations from Large Language Models

MEGAnno+: A Human-LLM Collaborative Annotation System

Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels

Large Language Models for Data Annotation and Synthesis: A Survey

Making Large Language Models Better Data Creators

LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition

Evolving Knowledge Distillation with Large Language Models and Active Learning

Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency

AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment

Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

Large Language Models Are Active Critics in NLG Evaluation