AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Zifan Song,Yudong Wang,Wenwei Zhang,Kuikun Liu,Chengqi Lyu,Demin Song,Qipeng Guo,Hang Yan,Dahua Lin,Kai Chen,Cairong Zhao
2024-05-30
Abstract:Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing code generation large language models (Code LLMs) cannot fully unleash the potential of pre-trained models when fine-tuned on a single data source due to limitations in data quality and diversity. Specifically, current methods mainly rely on specific types of code-related question-and-answer datasets, leading to a lack of necessary diversity in fine-tuning data, which in turn limits the model's performance, generalization ability, and robustness. To address these issues, the paper proposes the following key points: 1. **Multi-source Data Integration**: The paper attempts to overcome the limitations of quality and diversity of a single data source by integrating data from multiple sources. However, directly mixing multi-source data may lead to performance degradation because data from different sources may have conflicts, such as different programming language requirements and response styles. 2. **AlchemistPrompt**: To harmoniously handle the inherent conflicts in multi-source data, the paper introduces a data-specific prompt called AlchemistPrompt. These prompts are generated through hindsight relabeling, aiming to bridge the style differences between different data sources and enhance the alignment between instructions and responses. 3. **Code Understanding Tasks**: In addition to traditional code generation tasks, the paper also designs three code understanding tasks related to data construction: instruction evolution, data filtering, and code review. These tasks help improve the model's code understanding ability, further enhancing its performance. Through these methods, the paper demonstrates that the AlchemistCoder series models significantly outperform other open-source code generation models of the same scale in multiple benchmark tests, and in some cases, even surpass larger-scale models. This indicates that the methods proposed in the paper have a significant effect on improving the capabilities of code generation models.