AlpaGasus: Training A Better Alpaca with Fewer Data

Lichang Chen,Shiyang Li,Jun Yan,Hai Wang,Kalpa Gunaratna,Vikas Yadav,Zheng Tang,Vijay Srinivasan,Tianyi Zhou,Heng Huang,Hongxia Jin

DOI: https://doi.org/10.48550/arXiv.2307.08701

2024-02-14

Abstract:Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: <a class="link-external link-https" href="https://lichang-chen.github.io/AlpaGasus/" rel="external noopener nofollow">this https URL</a>

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: during the instruction fine - tuning (IFT) process, the existing IFT datasets (such as ALPACA's 52,000 data) contain a large number of low - quality instances. The answers of these instances are incorrect or irrelevant, thus being misleading and harmful to IFT. To solve this problem, the author proposes a simple and effective data selection strategy, using powerful language models (such as ChatGPT) to automatically identify and filter out low - quality data. Specifically, the main contributions of the paper include: 1. **Proposing the ALPAGASUS model**: By fine - tuning only with 9,000 high - quality data selected from the 52,000 ALPACA data, ALPAGASUS is significantly superior to the original ALPACA model. 2. **Improving training efficiency**: ALPAGASUS not only has better performance but also trains faster, reducing the training time. For example, the training time of the 7B variant is reduced from 80 minutes to 14 minutes. 3. **Verifying the universality of the method**: The experimental results show that this method is effective on different datasets, base models and LLM filters, proving its wide applicability. 4. **Emphasizing the importance of data quality**: Research shows that in IFT, the quality of data is more important than the quantity, and a small high - quality dataset can bring better model performance. Through these improvements, ALPAGASUS shows a new data - centered IFT paradigm, which can fine - tune large language models more efficiently and improve the instruction - following ability.

AlpaGasus: Training A Better Alpaca with Fewer Data

AlpaCare:Instruction-tuned Large Language Models for Medical Application

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

SelectIT: Selective Instruction Tuning for Large Language Models Via Uncertainty-Aware Self-Reflection

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Labeling supervised fine-tuning data with the scaling law

An Empirical Study of Instruction-tuning Large Language Models in Chinese

Jellyfish: A Large Language Model for Data Preprocessing

Pedagogical Alignment of Large Language Models

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

Annotation Guidelines-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Efficient Finetuning Large Language Models For Vietnamese Chatbot