Abstract:Large Language Models (LLMs) obtain their instruction-following ability through instruction tuning. While the quality of instruction data is considered critical for a successful LLM, the selection of high-quality datasets for finetuning still lacks clear guidelines and quantitative analyses. In this work, we introduce three analytical views for instruction mining: diversity, complexity, and accuracy, which can aid in selecting an optimal subset of instruction data for fine-tuning. Based on these views, we propose a multi-view fusion framework for efficient instruction selection, including diversity sampling based on LoRA representation distribution, complexity scoring based on uncertainty quantification, and accuracy scoring based on reward modeling. We perform the framework on various open-sourced instruction datasets, and achieved enhanced performance of LLMs with a carefully curated subset, underscoring the effectiveness of our proposed framework.

Multi-view Fusion for Instruction Mining of Large Language Model