How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

Yejie Wang,Keqing He,Dayuan Fu,Zhuoma Gongque,Heyang Xu,Yanxu Chen,Zhexu Wang,Yujia Fu,Guanting Dong,Muxi Diao,Jingang Wang,Mengdi Zhang,Xunliang Cai,Weiran Xu
2024-09-06
Abstract:Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in <a class="link-external link-https" href="https://github.com/banksy23/XCoder" rel="external noopener nofollow">this https URL</a>
Software Engineering,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the quality and effectiveness of the code - instruction - tuning dataset. Specifically: 1. **Identifying high - quality code - instruction data**: Currently, many code - instruction datasets perform well on popular benchmarks such as HumanEval, but perform poorly on other uncontaminated benchmarks such as LiveCodeBench. This indicates that some existing datasets may have data leakage problems, resulting in abnormally high performance of the model on specific benchmarks. Therefore, the paper aims to identify which datasets truly meet the criteria for high - quality code - instruction data. 2. **Proposing an effective data - screening strategy**: In order to construct a high - quality code - instruction dataset, the paper proposes an efficient data - pruning strategy to select good samples based on three dimensions: - **Instruction Complexity**: Use an evolutionary complexity scorer to predict the complexity of a given instruction. - **Response Quality**: Measure the quality of the response by generating multiple test cases and evaluating their pass rates. - **Instruction Diversity**: Select samples that are far from the existing data pool to increase data diversity. 3. **Verifying the effectiveness of the data strategy**: Based on the filtered dataset, the paper proposes the XCoder series of models and conducts experiments on multiple benchmarks such as LiveCodeBench and HumanEval. The results show that XCoder can achieve or exceed the performance of existing models with less training data, thus verifying the effectiveness of the data strategy. Through these methods, the paper not only solves the data leakage problems existing in the current datasets but also provides new insights and directions for future code - instruction - tuning research.