Code Comparison Tuning for Code Large Language Models

Yufan Jiang,Qiaozhi He,Xiaomin Zhuang,Zhihua Wu
2024-06-05
Abstract:We present Code Comparison Tuning (CCT), a simple and effective tuning method for code large language models (Code LLMs) to better handle subtle code errors. Specifically, we integrate the concept of comparison into instruction tuning, both at the token and sequence levels, enabling the model to discern even the slightest deviations in code. To compare the original code with an erroneous version containing manually added code errors, we use token-level preference loss for detailed token-level comparisons. Additionally, we combine code segments to create a new instruction tuning sample for sequence-level comparisons, enhancing the model's bug-fixing capability. Experimental results on the HumanEvalFix benchmark show that CCT surpasses instruction tuning in pass@1 scores by up to 4 points across diverse code LLMs, and extensive analysis demonstrates the effectiveness of our method.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the challenges faced by large language models for code (Code LLMs) in fixing code errors, particularly their inadequacy in handling subtle code errors. Specifically, the paper proposes a method called **Code Comparison Tuning (CCT)**, which enhances the model's ability to identify and fix code errors by integrating a code comparison mechanism during the instruction tuning process. CCT performs code comparison at two levels: **token level** and **sequence level**, enabling the model to more sensitively capture subtle differences in the code and effectively fix errors. Experimental results show that CCT significantly improves the performance of fixing code errors on multiple open-source Code LLMs, increasing the pass@1 score by up to 4 points in the HumanEvalFix benchmark compared to traditional instruction tuning methods. Additionally, the paper conducts ablation studies to verify the effectiveness of its method and explores the impact of different components on overall performance. Although CCT has approached or even surpassed the performance of closed-source models like GPT-4 on certain tasks, the authors note that further research is needed to fully evaluate its performance.