Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code Candidates

Zhihong Sun,Yao Wan,Jia Li,Hongyu Zhang,Zhi Jin,Ge Li,Chen Lyu
2024-09-19
Abstract:Large Language Models (LLMs), such as GPT-4, StarCoder, and CodeLlama, are transforming the way developers approach programming by automatically generating code based on given natural language descriptions. Despite advancements, generating syntactically and semantically correct code remains challenging, especially for complex programming tasks. Existing approaches typically generate multiple candidate solutions using LLMs to increase the likelihood of producing correct code. However, selecting the correct code from these candidates-a process known as code ranking-remains a major challenge. Current research on code ranking can be categorized into execution-based and non-execution-based methods. Execution-based methods, although effective, encounter notable limitations, such as scarcity of quality unit tests and security risks. Non-execution-based methods like CodeRanker, which rely solely on classification labels to train a code ranker, struggle to capture subtle errors and provide detailed error insights. Recognizing the strengths and limitations of both approaches, we propose a new method. The key insight of our work is that an effective code ranker is expected to truly comprehend the underlying causes of erroneous code, as relying solely on classification labels is insufficient. Inspired by this, this paper puts forward RankEF, an innovative approach for code ranking that leverages execution feedback. RankEF employs multi-task learning to integrate code classification with execution feedback generation. This approach enables the model to understand the reasons behind incorrect code, distinguishing between correct and incorrect solutions without the need to execute the code during the ranking phase. Experiments on three code generation benchmarks demonstrate that RankEF significantly outperforms the state-of-the-art CodeRanker.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively select the correct code from multiple code candidates generated by large language models (LLMs). Specifically, the paper focuses on the code ranking problem, that is, in the process of automatically generating code, how to ensure the selection of code with correct syntax and semantics through an effective evaluation and ranking mechanism. ### Problem Background With the development of large language models (such as GPT - 4, StarCoder, Code Llama, etc.), these models can automatically generate code according to a given natural language description or incomplete context code. However, generating completely correct code (that is, without syntax and semantics errors) is still a challenge, especially when dealing with complex programming tasks. To solve this problem, existing methods usually generate multiple code candidates to increase the possibility of generating correct code. However, the process of selecting the correct code from these candidates (that is, code ranking) is still very difficult. ### Limitations of Existing Methods 1. **Execution - based methods**: This method requires running the code and using unit tests to screen out the correct code. Although effective, there are problems such as the scarcity of high - quality unit tests and security risks. 2. **Non - execution - based methods**: For example, CodeRanker. Such methods only rely on classification labels to train the code ranker, and are unable to capture minor errors or provide detailed error information, resulting in limited ability to distinguish between correct and incorrect code. ### The Solution Proposed in the Paper To solve the above problems, the paper proposes a new method - RankEF (Ranking with Execution Feedback), which combines the advantages of execution - based and non - execution - based methods. The key of RankEF is to use execution feedback to enhance the understanding ability of the code ranker, so that it can not only distinguish between correct and incorrect code, but also understand the root cause of the error. ### How RankEF Works 1. **Multi - task learning framework**: RankEF adopts a multi - task learning framework, combining the code classification task and the execution feedback generation task. In this way, the model can use execution feedback during the training phase, and does not need to execute the code during the inference phase. 2. **Overcoming the challenges of relying on execution feedback**: RankEF designs three multi - task learning strategies (hard parameter sharing, soft parameter sharing, and intermediate fine - tuning) to balance the conflicts between different tasks and ensure that the model makes full use of execution feedback during the training process. 3. **Consistent and clean execution feedback**: To deal with the problems of inconsistent execution feedback formats and noise, RankEF classifies the execution results and uses templates to extract and integrate relevant information to ensure the consistency and quality of the training data. ### Experimental Results Experiments show that RankEF significantly outperforms the existing state - of - the - art method CodeRanker on three code generation benchmarks (APPS, MBPP, and HumanEval), achieving relative improvements of + 30.97%, + 31.43%, and + 19.51% on the Pass@1, Pass@2, and Pass@5 metrics respectively. ### Summary The main contributions of this paper are: 1. For the first time, combining classification labels and execution feedback to rank generated code candidates. 2. Designing a unified multi - task learning framework and exploring three different learning strategies. 3. Conducting extensive experiments on three widely recognized code generation benchmarks to verify the effectiveness of the proposed method. Through these improvements, RankEF can more accurately understand and distinguish between correct and incorrect code without relying on actual execution, thereby improving the quality and reliability of code generation.