FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

He Zhu,Junyou Su,Tianle Lun,Yicheng Tao,Wenjia Zhang,Zipei Fan,Guanhua Chen

2024-08-02

Abstract:Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.

Computation and Language

What problem does this paper attempt to address?

### The Problem This Paper Aims to Solve This paper aims to address the expensive and time-consuming issues present in the process of high-quality instruction data annotation. Specifically: 1. **Limitations of Traditional Annotation Methods**: - Current manual annotation methods are costly and can lead to inconsistent data quality, especially in terms of diversity and complexity. - The cost of using proprietary large language models (such as GPT-4) for API calls is also very high. 2. **Shortcomings of Existing Automatic Annotation Frameworks**: - Existing automatic annotation frameworks either rely on expensive API calls or require manually created seed datasets, making it difficult to achieve ideal results in terms of diversity and complexity. To tackle these issues, the paper proposes the FANNO framework, a fully autonomous, open-source framework capable of efficiently generating diverse and high-quality instruction datasets without the need for pre-annotated data. Through three stages—document pre-screening, instruction generation, and response generation—FANNO can significantly improve the quality and diversity of instruction data without increasing costs. Experimental results show that the datasets generated by FANNO are comparable to manually annotated datasets in terms of diversity and complexity, and even outperform them on certain metrics. This indicates that FANNO has significant advantages in instruction data generation.

FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud

Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models

Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

AlpaGasus: Training A Better Alpaca with Fewer Data

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Interactive Multi-fidelity Learning for Cost-effective Adaptation of Language Model with Sparse Human Supervision

REInstruct: Building Instruction Data from Unlabeled Corpus

An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models

Rethinking the Instruction Quality: LIFT is What You Need

Improving Translation Faithfulness of Large Language Models via Augmenting Instructions

Enhancing Task Performance in Continual Instruction Fine-tuning Through Format Uniformity

Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets

Instruction Finetuning for Leaderboard Generation from Empirical AI Research

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing