Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Simon Yu,Liangyu Chen,Sara Ahmadian,Marzieh Fadaee
2024-09-18
Abstract:Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at <a class="link-external link-https" href="https://github.com/for-ai/iterative-data-selection" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of how to select the optimal subset when fine-tuning large language models (LLMs) on instruction data. As the number of instruction datasets grows, choosing the right data to achieve the best results becomes increasingly important. Existing research mainly focuses on local criteria, such as instance quality, while the authors of this paper argue that global approaches, particularly those emphasizing data diversity, are more crucial. The paper proposes a k-means clustering-based method to ensure that the selected subset effectively represents the entire dataset. It also introduces an iterative refinement method inspired by active learning techniques. This method allows for resampling instances within clusters and re-evaluating the importance and sampling weights of each cluster in each training iteration, thereby reducing the impact of outliers and automatically filtering out clusters containing low-quality data. Through extensive evaluations, including tasks such as natural language inference, general world knowledge, code, and mathematical reasoning, the paper demonstrates consistent improvements with this method, showing a 7% increase compared to random selection and a 3.8% improvement over state-of-the-art sampling methods. In summary, the paper emphasizes the importance of prioritizing diversity in sampling methods when fine-tuning LLMs to enhance performance across a wide range of evaluation tasks.