Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Simon Yu,Liangyu Chen,Sara Ahmadian,Marzieh Fadaee

2024-09-18

Abstract:Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at <a class="link-external link-https" href="https://github.com/for-ai/iterative-data-selection" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the issue of how to select the optimal subset when fine-tuning large language models (LLMs) on instruction data. As the number of instruction datasets grows, choosing the right data to achieve the best results becomes increasingly important. Existing research mainly focuses on local criteria, such as instance quality, while the authors of this paper argue that global approaches, particularly those emphasizing data diversity, are more crucial. The paper proposes a k-means clustering-based method to ensure that the selected subset effectively represents the entire dataset. It also introduces an iterative refinement method inspired by active learning techniques. This method allows for resampling instances within clusters and re-evaluating the importance and sampling weights of each cluster in each training iteration, thereby reducing the impact of outliers and automatically filtering out clusters containing low-quality data. Through extensive evaluations, including tasks such as natural language inference, general world knowledge, code, and mathematical reasoning, the paper demonstrates consistent improvements with this method, showing a 7% increase compared to random selection and a 3.8% improvement over state-of-the-art sampling methods. In summary, the paper emphasizes the importance of prioritizing diversity in sampling methods when fine-tuning LLMs to enhance performance across a wide range of evaluation tasks.

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Data Selection for Task-Specific Model Finetuning

Reinforced Data Sampling for Model Diversification

G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation

Data Diversity Matters for Robust Instruction Tuning

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

TSDS: Data Selection for Task-Specific Model Finetuning

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Improving Data Efficiency via Curating LLM-Driven Rating Systems

$\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

DELIA: Diversity-Enhanced Learning for Instruction Adaptation in Large Language Models

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Rethinking Data Selection for Supervised Fine-Tuning