Abstract:Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.

What problem does this paper attempt to address?

The problem this paper attempts to address is that existing code generation large language models (Code LLMs) cannot fully unleash the potential of pre-trained models when fine-tuned on a single data source due to limitations in data quality and diversity. Specifically, current methods mainly rely on specific types of code-related question-and-answer datasets, leading to a lack of necessary diversity in fine-tuning data, which in turn limits the model's performance, generalization ability, and robustness. To address these issues, the paper proposes the following key points: 1. **Multi-source Data Integration**: The paper attempts to overcome the limitations of quality and diversity of a single data source by integrating data from multiple sources. However, directly mixing multi-source data may lead to performance degradation because data from different sources may have conflicts, such as different programming language requirements and response styles. 2. **AlchemistPrompt**: To harmoniously handle the inherent conflicts in multi-source data, the paper introduces a data-specific prompt called AlchemistPrompt. These prompts are generated through hindsight relabeling, aiming to bridge the style differences between different data sources and enhance the alignment between instructions and responses. 3. **Code Understanding Tasks**: In addition to traditional code generation tasks, the paper also designs three code understanding tasks related to data construction: instruction evolution, data filtering, and code review. These tasks help improve the model's code understanding ability, further enhancing its performance. Through these methods, the paper demonstrates that the AlchemistCoder series models significantly outperform other open-source code generation models of the same scale in multiple benchmark tests, and in some cases, even surpass larger-scale models. This indicates that the methods proposed in the paper have a significant effect on improving the capabilities of code generation models.

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Magicoder: Source Code Is All You Need

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

Improving Natural Language Capability of Code Large Language Model

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

Large Language Models as Code Executors: An Exploratory Study

UniCoder: Scaling Code Large Language Model via Universal Code

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with Really Good Data

Magicoder: Empowering Code Generation with OSS-Instruct

Evaluating and Aligning CodeLLMs on Human Preference

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

SelfCodeAlign: Self-Alignment for Code Generation