Abstract:As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been developed to enhance LLM performance, selecting high-quality instruction data from large source datasets typically demands significant human effort. In this work, we introduce $\textbf{IterSelectTune}$, an efficient, cost-effective iterative training policy for selecting high-quality instruction data with no human involvement and limited reliance on GPT-4. By fine-tuning on approximately 20\% of the source data, our method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets. These results highlight the effectiveness of our approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two main problems faced by large - scale language models (LLMs) during instruction - tuning: 1. **Selection of high - quality instruction data**: Although many instruction - tuning datasets have been developed to improve the performance of LLMs, the selection of high - quality instruction data usually requires a great deal of human effort. Existing methods rely on predefined metrics to evaluate data quality, which may not be applicable to all datasets or require extensive use of advanced models such as GPT - 4, which is cost - prohibitive. 2. **Reducing the demand for computational resources**: The traditional approach is to fine - tune LLMs on the entire dataset, but this is not only time - consuming but may also lead to model over - fitting. Therefore, how to maintain or improve model performance while reducing computational resources is an important research direction. To solve these problems, the authors propose an iterative training framework named IterSelectTune. The framework achieves its goals in the following ways: - **Efficient data selection**: Through an iterative training strategy, it automatically selects high - quality instruction data without human intervention and with limited dependence on GPT - 4. - **Reducing computational resources**: By fine - tuning on approximately 20% of the source data, the model performs better on multiple benchmark and public test datasets than models fine - tuned on the complete dataset. ### Main contributions 1. **Proposing an iterative training strategy framework**: This framework can efficiently select high - quality and diverse instruction data from large datasets while minimizing the use of GPT - 4 and human intervention, ensuring cost - effectiveness and scalability. 2. **Improvement in model performance**: The model fine - tuned on approximately 20% of the instruction data outperforms the model fine - tuned on the complete dataset on multiple benchmarks and test sets. 3. **Experimental verification**: Experiments on models such as Alpaca and WizardLM show that using a smaller amount of data (5% and 10% respectively) can achieve performance comparable to that of the complete - dataset model while requiring less time. ### Method overview 1. **Diversity module**: Ensure coverage of a wide range of instruction types. 2. **Iterative training classifier**: Identify high - quality data. 3. **Similarity module**: Give priority to instructions for data that are semantically close to the "difficult" data marked by GPT - 4. Through these two stages (the iterative training stage and the inference stage), IterSelectTune can efficiently select high - quality instruction data while ensuring data quality and diversity.

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Maybe Only 0.5 Training Data Instruction Tuning

A Survey on Data Selection for LLM Instruction Tuning

Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Instruction Tuning for Large Language Models: A Survey

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace

SelectIT: Selective Instruction Tuning for Large Language Models Via Uncertainty-Aware Self-Reflection

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

SelectLLM: Can LLMs Select Important Instructions to Annotate?

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

MoDS: Model-oriented Data Selection for Instruction Tuning

Instruction Tuning with GPT-4

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

InstructCoder: Instruction Tuning Large Language Models for Code Editing

Demystifying Instruction Mixing for Fine-tuning Large Language Models