Abstract:Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in <a class="link-external link-https" href="https://github.com/banksy23/XCoder" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the quality and effectiveness of the code - instruction - tuning dataset. Specifically: 1. **Identifying high - quality code - instruction data**: Currently, many code - instruction datasets perform well on popular benchmarks such as HumanEval, but perform poorly on other uncontaminated benchmarks such as LiveCodeBench. This indicates that some existing datasets may have data leakage problems, resulting in abnormally high performance of the model on specific benchmarks. Therefore, the paper aims to identify which datasets truly meet the criteria for high - quality code - instruction data. 2. **Proposing an effective data - screening strategy**: In order to construct a high - quality code - instruction dataset, the paper proposes an efficient data - pruning strategy to select good samples based on three dimensions: - **Instruction Complexity**: Use an evolutionary complexity scorer to predict the complexity of a given instruction. - **Response Quality**: Measure the quality of the response by generating multiple test cases and evaluating their pass rates. - **Instruction Diversity**: Select samples that are far from the existing data pool to increase data diversity. 3. **Verifying the effectiveness of the data strategy**: Based on the filtered dataset, the paper proposes the XCoder series of models and conducts experiments on multiple benchmarks such as LiveCodeBench and HumanEval. The results show that XCoder can achieve or exceed the performance of existing models with less training data, thus verifying the effectiveness of the data strategy. Through these methods, the paper not only solves the data leakage problems existing in the current datasets but also provides new insights and directions for future code - instruction - tuning research.

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with Really Good Data

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

InstructCoder: Instruction Tuning Large Language Models for Code Editing

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Evaluating and Aligning CodeLLMs on Human Preference

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

LLM-Assisted Code Cleaning For Training Accurate Code Generators

E-code: Mastering Efficient Code Generation through Pretrained Models and Expert Encoder Group

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Effi-Code: Unleashing Code Efficiency in Language Models

Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning