Abstract:With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required for training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering. In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations. Moreover, we find that data selecting is a consistent paradigm whether the pre-trained model is more capable or the model parameters scaling up. Our approach employs compact models with 550M parameters and incurs just 11.2% of the financial outlay of current methods, enhancing its industrial deployability.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: during the instruction tuning (IT) process, how to efficiently select high - quality and diverse instruction datasets. Specifically: 1. **Limitations of existing methods**: - Existing instruction data selection methods have problems such as relying on fragile external APIs, being affected by GPT model biases, and reducing the diversity of selected instruction datasets. - These methods perform poorly in terms of resource allocation and efficiency, especially facing challenges in industrial applications. 2. **Research objectives**: - Propose an industry - friendly instruction data selection method that is consistent with expert preferences and retains diversity, in order to improve model performance and reduce training costs. To solve the above problems, the authors propose a new method named "Clustering and Ranking (CaR)". This method is achieved through two steps: - **Quality assessment and ranking**: Use a high - precision (84.25%) scoring model to rank instruction pairs, and this model is aligned with expert preferences. - **Maintaining data diversity**: Ensure the diversity of the finally selected dataset through clustering techniques. The experimental results show that the CaR method only needs to select 1.96% of Alpaca's IT data, which can make the generated AlpaCaR model exceed Alpaca's performance by an average of 32.1% in the GPT - 4 evaluation, and show consistency on pre - trained models of different scales. In addition, the CaR method is also significantly superior to existing methods in terms of computational cost and time. In summary, this paper aims to solve the quality and diversity problems in instruction - tuning data selection by proposing the CaR method, thereby improving model performance and reducing costs.

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Data Diversity Matters for Robust Instruction Tuning

RECOST: External Knowledge Guided Data-efficient Instruction Tuning

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Improving Data Efficiency via Curating LLM-Driven Rating Systems

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

AlpaGasus: Training A Better Alpaca with Fewer Data

LESS: Selecting Influential Data for Targeted Instruction Tuning

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Maybe Only 0.5 Training Data Instruction Tuning

G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

AlpaCare:Instruction-tuned Large Language Models for Medical Application

Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging

Instruction Matters: A Simple yet Effective Task Selection for Optimized Instruction Tuning of Specific Tasks

Curriculum Learning with Quality-Driven Data Selection