Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Dheeraj Mekala,Alex Nguyen,Jingbo Shang
2024-02-16
Abstract:Instruction-tuning language models has become a crucial step in aligning them for general use. Typically, this process involves extensive training on large datasets, incurring high training costs. In this paper, we introduce a novel training data selection based on the learning percentage of the samples. We assert that current language models possess the capability to autonomously select high-quality training data, leading to comparable or improved performance compared to training on the entire dataset. Our experiments span different-sized models, revealing that this characteristic holds for models ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an interesting finding that the data hardness transfers across model sizes, and a smaller 350M model can effectively curate high-quality training data with hard samples for a larger 13B model, resulting in an equally or superior instruction-tuned model compared to training on the complete dataset. Utilizing open-sourced OPT and Llama-2 models up to 13B in size, two publicly available instruction-tuning training datasets and evaluated by both automatic metrics & humans, our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to reduce the cost of instruction - tuning for large - language models (LLMs) through more efficient data selection methods while maintaining or improving model performance. Specifically, the author proposes a new data selection method based on the sample learning percentage (LP), which can automatically identify the difficult samples that are most valuable for model training. Through this method, even smaller language models can select high - quality training data for larger models, thereby reducing the amount of data required for training, reducing costs, and achieving performance comparable to or even better than that of training with the complete data set. The key points mentioned in the paper include: 1. **Importance of data selection**: The traditional instruction - tuning process requires a large amount of training on large - scale data sets, which leads to high computational costs. Therefore, finding an effective method to select high - quality training data is crucial for improving training efficiency and reducing costs. 2. **Learning percentage as a difficulty indicator**: The author proposes a difficulty indicator based on the learning percentage (LP) to evaluate the difficulty of training samples. Specifically, LP is defined as the proportion of perplexity reduction of the sample in the early stage of training. The lower the LP value, the more difficult the sample is to learn. 3. **Transfer of data difficulty across model sizes**: The study found that the difficult samples identified by small models are also applicable to large models, which means that small models can effectively select training data for large models. 4. **Proposal of LPapp**: In order to further improve the efficiency of data selection, the author also proposes an approximate learning percentage indicator (LPapp), which can be calculated with only one training and has performance comparable to or better than that of LP. Through these methods, the paper shows how to significantly reduce the computational resources required for training without sacrificing model performance, providing a more efficient and economical solution for the instruction - tuning of large - scale language models.