Abstract:The strong performance of large language models (LLMs) on natural language processing tasks raises extensive discussion on their application to code generation. Recent work suggests multiple sampling approaches to improve initial code generation accuracy or program repair approaches to refine the code. However, these methods suffer from LLMs' inefficiencies and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, in the solution generation phase, the LLM first outlines a solution plan that decomposes the problem into manageable sub-problems and then verifies the generated solution plan through visible test cases. Subsequently, in the code implementation phase, the LLM initially drafts a code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended natural language solution to inform the refinement process for correcting bugs. We further introduce SLPW, a sampling variant of LPW, which initially generates multiple solution plans and plan verifications, produces a program for each plan and its verification, and refines each program as necessary until one successfully passes the visible tests. Compared to the state-of-the-art methods across various existing LLMs, our experimental results show that LPW significantly improves the Pass@1 accuracy by up to 16.4% on well-established text-to-code generation benchmarks, especially with a notable improvement of around 10% on challenging benchmarks. Additionally, SLPW demonstrates up to a 5.6% improvement over LPW and sets new state-of-the-art Pass@1 accuracy on various benchmarks, e.g., 98.2% on HumanEval, 84.8% on MBPP, 64.0% on APPS, and 35.3% on CodeContest, using GPT-4o as the backbone.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of large - language models (LLMs) in code - generation tasks, especially their low efficiency and limited reasoning ability. Although existing methods improve the accuracy of initial code generation through multiple sampling or program repair, these methods still have some limitations, such as low sampling efficiency, conflict with human programming strategies, and lack of precise correction guidance in feedback information. In addition, in multi - agent collaborative code generation, ineffective feedback mechanisms will reduce the communication quality, especially when too many agents are involved, which will increase token consumption. To address these problems, the authors propose a large - language - model programming workflow named LPW (Large Language Model Programming Workflow), aiming to improve the initial stage of code generation and the subsequent correction process. LPW designs a structured two - stage workflow: the solution - generation stage and the code - implementation stage. Specifically: 1. **Solution - generation stage**: In this stage, the LLM first formulates a solution plan, decomposes the problem into several manageable sub - problems, and verifies the generated solution plan through visible test cases. 2. **Code - implementation stage**: According to the solution plan and its verification results, the LLM initially writes the code. If the generated code fails to pass the visible test, the verification of the solution plan will serve as a natural - language solution to continuously guide the correction process to correct errors. In addition, the authors also introduce SLPW (Sampling LPW), which is a sampling variant of LPW. It generates multiple solution plans and plan verifications in the solution - generation stage, generates programs for each plan and its verification, and corrects each program when necessary until one of them successfully passes the visible test. The experimental results show that, compared with the existing state - of - the - art methods, LPW significantly improves the Pass@1 accuracy rate by up to 16.4%, especially in challenging benchmark tests. SLPW further improves the performance, with a maximum increase of 5.6% in Pass@1 accuracy rate, and sets new state - of - the - art levels on multiple benchmark tests.

Planning-Driven Programming: A Large Language Model Programming Workflow

Self-planning Code Generation with Large Language Models

Enabling Programming Thinking in Large Language Models Toward Code Generation

Multi-Programming Language Ensemble for Code Generation in Large Language Model

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Large Language Models as Code Executors: An Exploratory Study

Learning to Program with Natural Language

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

Interactive and Expressive Code-Augmented Planning with Large Language Models

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Improving Natural Language Capability of Code Large Language Model

Evaluating Large Language Models in Class-Level Code Generation

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation

A Pair Programming Framework for Code Generation Via Multi-Plan Exploration and Feedback-Driven Refinement

A Survey on Evaluating Large Language Models in Code Generation Tasks

Steering Large Language Models between Code Execution and Textual Reasoning

Examination of Code generated by Large Language Models

Framework for evaluating code generation ability of large language models

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

Towards Large Language Model Aided Program Refinement