AlpaGasus: Training A Better Alpaca with Fewer Data

Lichang Chen,Shiyang Li,Jun Yan,Hai Wang,Kalpa Gunaratna,Vikas Yadav,Zheng Tang,Vijay Srinivasan,Tianyi Zhou,Heng Huang,Hongxia Jin
DOI: https://doi.org/10.48550/arXiv.2307.08701
2024-02-14
Abstract:Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: <a class="link-external link-https" href="https://lichang-chen.github.io/AlpaGasus/" rel="external noopener nofollow">this https URL</a>
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: during the instruction fine - tuning (IFT) process, the existing IFT datasets (such as ALPACA's 52,000 data) contain a large number of low - quality instances. The answers of these instances are incorrect or irrelevant, thus being misleading and harmful to IFT. To solve this problem, the author proposes a simple and effective data selection strategy, using powerful language models (such as ChatGPT) to automatically identify and filter out low - quality data. Specifically, the main contributions of the paper include: 1. **Proposing the ALPAGASUS model**: By fine - tuning only with 9,000 high - quality data selected from the 52,000 ALPACA data, ALPAGASUS is significantly superior to the original ALPACA model. 2. **Improving training efficiency**: ALPAGASUS not only has better performance but also trains faster, reducing the training time. For example, the training time of the 7B variant is reduced from 80 minutes to 14 minutes. 3. **Verifying the universality of the method**: The experimental results show that this method is effective on different datasets, base models and LLM filters, proving its wide applicability. 4. **Emphasizing the importance of data quality**: Research shows that in IFT, the quality of data is more important than the quantity, and a small high - quality dataset can bring better model performance. Through these improvements, ALPAGASUS shows a new data - centered IFT paradigm, which can fine - tune large language models more efficiently and improve the instruction - following ability.