Abstract:The strong performance of large language models (LLMs) on natural language processing tasks raises extensive discussion on their application to code generation. Recent work suggests multiple sampling approaches to improve initial code generation accuracy or program repair approaches to refine the code. However, these methods suffer from LLMs' inefficiencies and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, in the solution generation phase, the LLM first outlines a solution plan that decomposes the problem into manageable sub-problems and then verifies the generated solution plan through visible test cases. Subsequently, in the code implementation phase, the LLM initially drafts a code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended natural language solution to inform the refinement process for correcting bugs. We further introduce SLPW, a sampling variant of LPW, which initially generates multiple solution plans and plan verifications, produces a program for each plan and its verification, and refines each program as necessary until one successfully passes the visible tests. Compared to the state-of-the-art methods across various existing LLMs, our experimental results show that LPW significantly improves the Pass@1 accuracy by up to 16.4% on well-established text-to-code generation benchmarks, especially with a notable improvement of around 10% on challenging benchmarks. Additionally, SLPW demonstrates up to a 5.6% improvement over LPW and sets new state-of-the-art Pass@1 accuracy on various benchmarks, e.g., 98.2% on HumanEval, 84.8% on MBPP, 64.0% on APPS, and 35.3% on CodeContest, using GPT-4o as the backbone.

Large Language Models to Generate System-Level Test Programs Targeting Non-functional Properties

Exploring and Characterizing Large Language Models For Embedded System Development and Debugging

Automated Control Logic Test Case Generation using Large Language Models

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

Evaluating Large Language Models for Automatic Register Transfer Logic Generation via High-Level Synthesis

LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation

LLM4VV: Developing LLM-driven testsuite for compiler validation

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Evaluating LLMs for Hardware Design and Test

A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

Planning-Driven Programming: A Large Language Model Programming Workflow

Code Simulation Challenges for Large Language Models

On the Evaluation of Large Language Models in Unit Test Generation

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation

C2HLSC: Leveraging Large Language Models to Bridge the Software-to-Hardware Design Gap

Software Testing with Large Language Models: Survey, Landscape, and Vision

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Fully Autonomous Programming with Large Language Models