Abstract:Instruction-tuning language models has become a crucial step in aligning them for general use. Typically, this process involves extensive training on large datasets, incurring high training costs. In this paper, we introduce a novel training data selection based on the learning percentage of the samples. We assert that current language models possess the capability to autonomously select high-quality training data, leading to comparable or improved performance compared to training on the entire dataset. Our experiments span different-sized models, revealing that this characteristic holds for models ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an interesting finding that the data hardness transfers across model sizes, and a smaller 350M model can effectively curate high-quality training data with hard samples for a larger 13B model, resulting in an equally or superior instruction-tuned model compared to training on the complete dataset. Utilizing open-sourced OPT and Llama-2 models up to 13B in size, two publicly available instruction-tuning training datasets and evaluated by both automatic metrics & humans, our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to reduce the cost of instruction - tuning for large - language models (LLMs) through more efficient data selection methods while maintaining or improving model performance. Specifically, the author proposes a new data selection method based on the sample learning percentage (LP), which can automatically identify the difficult samples that are most valuable for model training. Through this method, even smaller language models can select high - quality training data for larger models, thereby reducing the amount of data required for training, reducing costs, and achieving performance comparable to or even better than that of training with the complete data set. The key points mentioned in the paper include: 1. **Importance of data selection**: The traditional instruction - tuning process requires a large amount of training on large - scale data sets, which leads to high computational costs. Therefore, finding an effective method to select high - quality training data is crucial for improving training efficiency and reducing costs. 2. **Learning percentage as a difficulty indicator**: The author proposes a difficulty indicator based on the learning percentage (LP) to evaluate the difficulty of training samples. Specifically, LP is defined as the proportion of perplexity reduction of the sample in the early stage of training. The lower the LP value, the more difficult the sample is to learn. 3. **Transfer of data difficulty across model sizes**: The study found that the difficult samples identified by small models are also applicable to large models, which means that small models can effectively select training data for large models. 4. **Proposal of LPapp**: In order to further improve the efficiency of data selection, the author also proposes an approximate learning percentage indicator (LPapp), which can be calculated with only one training and has performance comparable to or better than that of LP. Through these methods, the paper shows how to significantly reduce the computational resources required for training without sacrificing model performance, providing a more efficient and economical solution for the instruction - tuning of large - scale language models.

Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Maybe Only 0.5 Training Data Instruction Tuning

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Stronger Models are NOT Stronger Teachers for Instruction Tuning

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

TAIA: Large Language Models are Out-of-Distribution Data Learners

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Smaller Language Models Are Better Instruction Evolvers

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Selecting large language model to fine-tune via rectified scaling law

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

MoDS: Model-oriented Data Selection for Instruction Tuning

Small Language Model as Data Prospector for Large Language Model

LESS: Selecting Influential Data for Targeted Instruction Tuning

Instruction Tuning for Large Language Models: A Survey

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?