From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Ming Li,Yong Zhang,Zhitao Li,Jiuhai Chen,Lichang Chen,Ning Cheng,Jianzong Wang,Tianyi Zhou,Jing Xiao
2024-04-06
Abstract:In the realm of Large Language Models (LLMs), the balance between instruction data quality and quantity is a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability. Through the application of IFD, cherry samples can be pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on datasets like Alpaca and WizardLM underpin our findings; with a mere $10\%$ of original data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the instruction tuning of LLMs, promising both efficiency and resource-conscious advancements. Codes, data, and models are available: <a class="link-external link-https" href="https://github.com/tianyi-lab/Cherry_LLM" rel="external noopener nofollow">this https URL</a>
Computation and Language
What problem does this paper attempt to address?
The problem addressed in this paper is how to select high-quality data in a more efficient and resource-aware manner for fine-tuning large language models (LLMs). The paper proposes a self-guided approach that automatically identifies significant samples from open-source datasets using the "Instruction Following Difficulty" (IFD) metric, reducing the need for manual data filtering. This method improves model training efficiency and achieves superior performance compared to the original model with limited data input.