RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance

Haolin Jin,Zechao Sun,Huaming Chen
2024-10-03
Abstract:Large Language Models (LLMs) have shown incredible potential in code generation tasks, and recent research in prompt engineering have enhanced LLMs' understanding of textual information. However, ensuring the accuracy of generated code often requires extensive testing and validation by programmers. While LLMs can typically generate code based on task descriptions, their accuracy remains limited, especially for complex tasks that require a deeper understanding of both the problem statement and the code generation process. This limitation is primarily due to the LLMs' need to simultaneously comprehend text and generate syntactically and semantically correct code, without having the capability to automatically refine the code. In real-world software development, programmers rarely produce flawless code in a single attempt based on the task description alone, they rely on iterative feedback and debugging to refine their programs. Inspired by this process, we introduce a novel architecture of LLM-based agents for code generation and automatic debugging: Refinement and Guidance Debugging (RGD). The RGD framework is a multi-LLM-based agent debugger that leverages three distinct LLM agents-Guide Agent, Debug Agent, and Feedback Agent. RGD decomposes the code generation task into multiple steps, ensuring a clearer workflow and enabling iterative code refinement based on self-reflection and feedback. Experimental results demonstrate that RGD exhibits remarkable code generation capabilities, achieving state-of-the-art performance with a 9.8% improvement on the HumanEval dataset and a 16.2% improvement on the MBPP dataset compared to the state-of-the-art approaches and traditional direct prompting approaches. We highlight the effectiveness of the RGD framework in enhancing LLMs' ability to generate and refine code autonomously.
Software Engineering,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the accuracy and reliability issues of large - language models (LLMs) in code - generation tasks. Although LLMs perform well in generating code based on text - based task descriptions, the generated code often requires extensive testing and verification by programmers. Especially when dealing with complex tasks, it is difficult for LLMs to simultaneously understand the task description and generate syntactically and semantically correct code. In addition, although the existing multi - round code - generation frameworks have improved code quality through iterative generation and debugging, there are still some limitations in practical applications, such as the inability to effectively use failed test cases for self - repair and excessive dependence on task descriptions. To this end, the paper proposes a new architecture - RGD (Refinement and Guidance Debugging), which improves the quality of code generation through the collaborative work of multiple LLM agents. The RGD framework contains three specialized LLM agents: the Guide Agent, the Debug Agent, and the Feedback Agent. These agents are respectively responsible for generating generation guidelines, initial code generation and debugging, and failure analysis and correction suggestions based on execution results. Through this phased workflow and iterative code optimization based on introspection and feedback, RGD aims to improve the capabilities of LLMs in code generation and automatic debugging, especially their performance when dealing with complex tasks. The experimental results show that RGD has achieved significant performance improvements on multiple benchmark datasets. For example, its performance on the HumanEval dataset is 9.8% better than the current state - of - the - art method, and the improvement on the MBPP dataset has reached 16.2%. This indicates that the RGD framework can effectively enhance the ability of LLMs to generate high - quality code and performs well when dealing with challenging programming tasks.