Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs

Sylvain Kouemo Ngassom,Arghavan Moradi Dakhel,Florian Tambon,Foutse Khomh

2024-05-23

Abstract:LLM-based assistants, such as GitHub Copilot and ChatGPT, have the potential to generate code that fulfills a programming task described in a natural language description, referred to as a prompt. The widespread accessibility of these assistants enables users with diverse backgrounds to generate code and integrate it into software projects. However, studies show that code generated by LLMs is prone to bugs and may miss various corner cases in task specifications. Presenting such buggy code to users can impact their reliability and trust in LLM-based assistants. Moreover, significant efforts are required by the user to detect and repair any bug present in the code, especially if no test cases are available. In this study, we propose a self-refinement method aimed at improving the reliability of code generated by LLMs by minimizing the number of bugs before execution, without human intervention, and in the absence of test cases. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. These VQs target various nodes within the Abstract Syntax Tree (AST) of the initial code, which have the potential to trigger specific types of bug patterns commonly found in LLM-generated code. Finally, our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code. Our evaluation, based on programming tasks in the CoderEval dataset, demonstrates that our proposed method outperforms state-of-the-art methods by decreasing the number of targeted errors in the code between 21% to 62% and improving the number of executable code instances to 13%.

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the existence of errors in the code generated by large - language models (LLMs), which may affect users' reliability and trust in LLM - assisted tools. Specifically, the paper focuses on how to improve the reliability of code generated by LLM through a self - refinement method in the absence of test cases. This method is based on a series of Verification Questions (VQs) to identify and fix potential errors in the initial code, thereby reducing specific types of errors in the code and increasing the number of executable code instances. The paper specifically focuses on two common error patterns: "Wrong Attribute" and "Hallucinated Object", and realizes the self - repair of the code by designing VQ templates for these error patterns. This method can not only reduce errors, but also improve the reliability of code generated by LLM without executing the code or requiring comprehensive test cases.

Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs

Understanding Defects in Generated Codes by Language Models

Validating LLM-Generated Programs with Metamorphic Prompt Testing

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Automatic High-quality Verilog Assertion Generation through Subtask-Focused Fine-Tuned LLMs and Iterative Prompting

BugSpotter: Automated Generation of Code Debugging Exercises

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Large Language Models of Code Fail at Completing Code with Potential Bugs

Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

Bugs in Large Language Models Generated Code: An Empirical Study

A Deep Dive into Large Language Model Code Generation Mistakes: What and Why?

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Effective test generation using pre-trained Large Language Models and mutation testing

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

A Review of Repository Level Prompting for LLMs

Fixing Large Language Models' Specification Misunderstanding for Better Code Generation

Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

From Misuse to Mastery: Enhancing Code Generation with Knowledge-Driven AI Chaining

LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward