FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

He Zhu,Junyou Su,Tianle Lun,Yicheng Tao,Wenjia Zhang,Zipei Fan,Guanhua Chen
2024-08-02
Abstract:Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.
Computation and Language
What problem does this paper attempt to address?
### The Problem This Paper Aims to Solve This paper aims to address the expensive and time-consuming issues present in the process of high-quality instruction data annotation. Specifically: 1. **Limitations of Traditional Annotation Methods**: - Current manual annotation methods are costly and can lead to inconsistent data quality, especially in terms of diversity and complexity. - The cost of using proprietary large language models (such as GPT-4) for API calls is also very high. 2. **Shortcomings of Existing Automatic Annotation Frameworks**: - Existing automatic annotation frameworks either rely on expensive API calls or require manually created seed datasets, making it difficult to achieve ideal results in terms of diversity and complexity. To tackle these issues, the paper proposes the FANNO framework, a fully autonomous, open-source framework capable of efficiently generating diverse and high-quality instruction datasets without the need for pre-annotated data. Through three stages—document pre-screening, instruction generation, and response generation—FANNO can significantly improve the quality and diversity of instruction data without increasing costs. Experimental results show that the datasets generated by FANNO are comparable to manually annotated datasets in terms of diversity and complexity, and even outperform them on certain metrics. This indicates that FANNO has significant advantages in instruction data generation.