Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

Weiqing Yang,Hanbin Wang,Zhenghao Liu,Xinze Li,Yukun Yan,Shuo Wang,Yu Gu,Minghe Yu,Zhiyuan Liu,Ge Yu
2024-08-09
Abstract:Debugging is a vital aspect of software development, yet the debugging capabilities of Large Language Models (LLMs) remain largely unexplored. This paper first introduces DEBUGEVAL, a comprehensive benchmark designed to evaluate the debugging capabilities of LLMs. DEBUGEVAL collects data from existing high-quality datasets and designs four different tasks to evaluate the debugging effectiveness, including BUG Localization, BUG Identification, Code Review, and Code Repair. Additionally, to enhance the code debugging ability of LLMs, this paper proposes a CoMmunicative Agent BaSed DaTa REfinement FRamework (MASTER), which generates the refined code debugging data for supervised finetuning. Specifically, MASTER employs the Code Quizzer to generate refined data according to the defined tasks of DEBUGEVAL. Then the Code Learner acts as a critic and reserves the generated problems that it can not solve. Finally, the Code Teacher provides a detailed Chain-of-Thought based solution to deal with the generated problem. We collect the synthesized data and finetune the Code Learner to enhance the debugging ability and conduct the NeuDebugger model. Our experiments evaluate various LLMs and NeuDebugger in the zero-shot setting on DEBUGEVAL. Experimental results demonstrate that these 7B-scale LLMs have weaker debugging capabilities, even these code-oriented LLMs. On the contrary, these larger models (over 70B) show convincing debugging ability. Our further analyses illustrate that MASTER is an effective method to enhance the code debugging ability by synthesizing data for Supervised Fine-Tuning (SFT) LLMs.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address two main issues: 1. **Evaluating the code debugging capabilities of large language models (LLMs)**: - Currently, although large language models perform well in tasks such as code generation and translation, their performance in code debugging has not been fully explored and evaluated. To fill this gap, the authors designed a comprehensive benchmark tool—**DEBUG EVAL**—to assess the code debugging capabilities of large language models. - **DEBUG EVAL** includes four different tasks: BUG Localization, BUG Identification, Code Review, and Code Repair. These tasks aim to comprehensively evaluate the model's ability to identify, classify errors, and provide correct solutions. 2. **Enhancing the code debugging capabilities of large language models**: - To address the issues of data singularity and insufficient complexity in existing code debugging benchmarks, the authors proposed a data refinement framework based on communication agents—**MASTER**. - The **MASTER** framework works through three agents (Code Quizzer, Code Learner, and Code Teacher) to generate high-quality supervised fine-tuning data to enhance the code debugging capabilities of large language models. - Specifically, **Code Quizzer** is responsible for generating diverse code debugging problems, **Code Learner** acts as an evaluator, retaining those problems it cannot solve, and **Code Teacher** provides detailed solutions and explanations. Through these two efforts, the paper not only provides a comprehensive benchmark tool for evaluating the code debugging capabilities of large language models but also proposes an effective method to enhance these models' debugging performance in real development environments.