Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge,Yilun Liu,Chi Hu,Weibin Meng,Shimin Tao,Xiaofeng Zhao,Hongxia Ma,Li Zhang,Boxing Chen,Hao Yang,Bei Li,Tong Xiao,Jingbo Zhu
2024-10-12
Abstract:With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required for training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering. In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations. Moreover, we find that data selecting is a consistent paradigm whether the pre-trained model is more capable or the model parameters scaling up. Our approach employs compact models with 550M parameters and incurs just 11.2% of the financial outlay of current methods, enhancing its industrial deployability.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: during the instruction tuning (IT) process, how to efficiently select high - quality and diverse instruction datasets. Specifically: 1. **Limitations of existing methods**: - Existing instruction data selection methods have problems such as relying on fragile external APIs, being affected by GPT model biases, and reducing the diversity of selected instruction datasets. - These methods perform poorly in terms of resource allocation and efficiency, especially facing challenges in industrial applications. 2. **Research objectives**: - Propose an industry - friendly instruction data selection method that is consistent with expert preferences and retains diversity, in order to improve model performance and reduce training costs. To solve the above problems, the authors propose a new method named "Clustering and Ranking (CaR)". This method is achieved through two steps: - **Quality assessment and ranking**: Use a high - precision (84.25%) scoring model to rank instruction pairs, and this model is aligned with expert preferences. - **Maintaining data diversity**: Ensure the diversity of the finally selected dataset through clustering techniques. The experimental results show that the CaR method only needs to select 1.96% of Alpaca's IT data, which can make the generated AlpaCaR model exceed Alpaca's performance by an average of 32.1% in the GPT - 4 evaluation, and show consistency on pre - trained models of different scales. In addition, the CaR method is also significantly superior to existing methods in terms of computational cost and time. In summary, this paper aims to solve the quality and diversity problems in instruction - tuning data selection by proposing the CaR method, thereby improving model performance and reducing costs.