Abstract:Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As such, trainable models are still the preferred option in some cases. However, these models still require human-labeled data for optimal performance, which is expensive and time-consuming to obtain. In order to address this issue, several techniques to reduce human effort involve labeling or generating data using LLMs. Although these methods are effective for certain applications, in practice they encounter difficulties in real-world scenarios. Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering. In this paper, we propose a unified data creation pipeline that requires only a single formatting example, and which is applicable to a broad range of tasks, including traditionally problematic ones with semantically devoid label spaces. In our experiments we demonstrate that instruction-following LLMs are highly cost-effective data creators, and that models trained with these data exhibit performance better than those trained with human-labeled data (by up to 17.5%) on out-of-distribution evaluation, while maintaining comparable performance on in-distribution tasks. These results have important implications for the robustness of NLP systems deployed in the real-world.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively utilize large - language models (LLMs) to generate high - quality training data in natural language processing (NLP) tasks, so as to reduce the need for manually - annotated data. Specifically, the paper focuses on using LLMs as data generators to create training data in resource - limited or specialized fields. These data can be used to train downstream models, and these models perform better in out - of - distribution (OOD) evaluations than models trained with manually - annotated data. The paper points out that although LLMs have made remarkable progress in the NLP field, there are still challenges in cost, response speed, control, privacy and security when deploying these models in practical applications. Therefore, in some cases, trainable models are still preferred. However, these models require a large amount of manually - annotated data to achieve optimal performance, which is both expensive and time - consuming. To solve this problem, some techniques attempt to reduce human effort by using LLMs to annotate or generate data. Although these methods are effective for some applications, difficulties are encountered in practical scenarios, such as data annotation requires careful selection of data, and data generation requires prompt engineering for specific tasks. To solve these problems, the paper proposes a unified data creation pipeline, which can be applied to a wide range of tasks, including traditional difficult problems with missing label - space semantics, with only one formatted example. Experimental results show that LLMs following instructions, as efficient data generators, the models trained with these data perform 17.5% better in out - of - distribution evaluations than models trained with manually - annotated data, while maintaining comparable performance in in - distribution tasks. These results are of great significance for the robustness of NLP systems deployed in the real world.

Making Large Language Models Better Data Creators

Automatic Text Labeling Method Based on Large Language Models

LLMaAA: Making Large Language Models as Active Annotators

Large Language Models for Data Annotation: A Survey

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Large Language Models for Data Annotation and Synthesis: A Survey

The Importance of Human-Labeled Data in the Era of LLMs

How to Train Data-Efficient LLMs

Can Large Language Models Design Accurate Label Functions?

Progressively Label Enhancement for Large Language Model Alignment

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Empowering Large Language Models for Textual Data Augmentation

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency

Large Language Models as Data Preprocessors

Large Language Models Humanize Technology

Supervised Knowledge Makes Large Language Models Better In-context Learners

A Survey on Data Synthesis and Augmentation for Large Language Models

TnT-LLM: Text Mining at Scale with Large Language Models

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications