IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Jielin Song,Siyu Liu,Bin Zhu,Yanghui Rao
2024-10-17
Abstract:As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been developed to enhance LLM performance, selecting high-quality instruction data from large source datasets typically demands significant human effort. In this work, we introduce $\textbf{IterSelectTune}$, an efficient, cost-effective iterative training policy for selecting high-quality instruction data with no human involvement and limited reliance on GPT-4. By fine-tuning on approximately 20\% of the source data, our method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets. These results highlight the effectiveness of our approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve two main problems faced by large - scale language models (LLMs) during instruction - tuning: 1. **Selection of high - quality instruction data**: Although many instruction - tuning datasets have been developed to improve the performance of LLMs, the selection of high - quality instruction data usually requires a great deal of human effort. Existing methods rely on predefined metrics to evaluate data quality, which may not be applicable to all datasets or require extensive use of advanced models such as GPT - 4, which is cost - prohibitive. 2. **Reducing the demand for computational resources**: The traditional approach is to fine - tune LLMs on the entire dataset, but this is not only time - consuming but may also lead to model over - fitting. Therefore, how to maintain or improve model performance while reducing computational resources is an important research direction. To solve these problems, the authors propose an iterative training framework named IterSelectTune. The framework achieves its goals in the following ways: - **Efficient data selection**: Through an iterative training strategy, it automatically selects high - quality instruction data without human intervention and with limited dependence on GPT - 4. - **Reducing computational resources**: By fine - tuning on approximately 20% of the source data, the model performs better on multiple benchmark and public test datasets than models fine - tuned on the complete dataset. ### Main contributions 1. **Proposing an iterative training strategy framework**: This framework can efficiently select high - quality and diverse instruction data from large datasets while minimizing the use of GPT - 4 and human intervention, ensuring cost - effectiveness and scalability. 2. **Improvement in model performance**: The model fine - tuned on approximately 20% of the instruction data outperforms the model fine - tuned on the complete dataset on multiple benchmarks and test sets. 3. **Experimental verification**: Experiments on models such as Alpaca and WizardLM show that using a smaller amount of data (5% and 10% respectively) can achieve performance comparable to that of the complete - dataset model while requiring less time. ### Method overview 1. **Diversity module**: Ensure coverage of a wide range of instruction types. 2. **Iterative training classifier**: Identify high - quality data. 3. **Similarity module**: Give priority to instructions for data that are semantically close to the "difficult" data marked by GPT - 4. Through these two stages (the iterative training stage and the inference stage), IterSelectTune can efficiently select high - quality instruction data while ensuring data quality and diversity.