Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Da Yin,Xiao Liu,Fan Yin,Ming Zhong,Hritik Bansal,Jiawei Han,Kai-Wei Chang
DOI: https://doi.org/10.48550/arXiv.2305.14327
2023-10-26
Abstract:Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) to comprehend instructions and generate appropriate responses. Existing methods either manually annotate or employ LLM (e.g., GPT-series) to generate data for instruction tuning. However, they often overlook associating instructions with existing annotated datasets. In this paper, we propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions (e.g., it costs less than $12 USD by calling GPT-3.5-turbo for generating 800K instruction tuning samples; 2) it provides high-quality data for instruction tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform with comparable data sizes); and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available. We further investigate a continual learning scheme for learning with the ever-growing instruction-tuning dataset, and demonstrate that replaying tasks with diverse instruction embeddings not only helps mitigate forgetting issues but generalizes to unseen tasks better. Code and data are available at <a class="link-external link-https" href="https://github.com/WadeYin9712/Dynosaur" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?