Abstract:The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code autonomously, significantly reducing the manual effort required for various programming tasks. Although, the potential benefits of LLM-generated code are vast, most notably in efficiency and rapid prototyping, as LLMs become increasingly integrated into the software development lifecycle and hence the supply chain, complex and multifaceted challenges arise as the code generated from these language models carry profound questions on quality and correctness. Research is required to comprehensively explore these critical concerns surrounding LLM-generated code. In this paper, we propose a novel solution called metamorphic prompt testing to address these challenges. Our intuitive observation is that intrinsic consistency always exists among correct code pieces but may not exist among flawed code pieces, so we can detect flaws in the code by detecting inconsistencies. Therefore, we can vary a given prompt to multiple prompts with paraphrasing, and to ask the LLM to acquire multiple versions of generated code, so that we can validate whether the semantic relations still hold in the acquired code through cross-validation. Our evaluation on HumanEval shows that metamorphic prompt testing is able to detect 75 percent of the erroneous programs generated by GPT-4, with a false positive rate of 8.6 percent.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to verify the quality and correctness of the code generated by large language models (LLMs) in the absence of standard solutions or real - world outputs. Specifically, the paper proposes a new method named "metamorphic prompt testing" to detect errors in the code generated by LLMs. This method is based on the assumption that for prompts with the same meaning, LLMs should generate programs with the same semantics. By comparing the outputs of the code generated by different variant prompts, inconsistencies in the code can be detected, thereby uncovering potential errors. ### Main Contributions 1. **Proposed a new verification method**: Verify the programs generated by LLMs without any standard solutions or real - world outputs. 2. **Developed a new algorithm**: Report errors based on the conflicting outputs of multiple LLM - generated programs. 3. **Experimental evaluation**: Evaluated the effectiveness of this method on the well - known benchmark HumanEval. ### Method Overview 1. **Prompt variant generation**: Use natural language processing techniques to generate multiple variants of the original prompt. 2. **Target program generation**: Input the original prompt into the LLM to generate the target program. 3. **Variant program generation**: Input the prompt variants into the LLM to generate multiple variant programs. 4. **Automated test generation**: Use automated test generation tools to generate test inputs. 5. **Cross - validation**: Run the target program and variant programs and compare their outputs to detect inconsistencies. ### Experimental Results - **Accuracy**: When using 5 prompt variants, the accuracy rate is 89.0%. - **Recall**: When using 5 prompt variants, the recall rate is 75.0%. - **Precision**: When using 5 prompt variants, the precision rate is 60.0%. - **False positive rate**: When using 5 prompt variants, the false positive rate is 8.6%. ### Experimental Setup - **Dataset**: Use the HumanEval dataset, which contains 164 human - written prompts and their corresponding canonical solutions. - **LLM model**: Use the gpt - 4 - 1106 - preview model provided by OpenAI. ### Ablation Study - **No prompt variants**: Only input the same prompt into the LLM multiple times. The results show that this method has a lower recall rate but also a lower false positive rate. - **Conservative cross - validation**: Use a more stringent cross - validation strategy. The results show that this method has a higher recall rate but a lower precision rate. ### Qualitative Analysis - **False positive cases**: Manually analyzed the false positive cases and found that some variant prompts in these cases may introduce noise, resulting in inconsistent outputs. - **False negative cases**: Manually analyzed the false negative cases and found that the errors in these cases are more concealed and difficult to detect through simple output comparison. Through these methods and experiments, the paper provides an effective means to verify the quality and correctness of the code generated by LLMs, providing important support for code generation in practical applications.

Validating LLM-Generated Programs with Metamorphic Prompt Testing

LLM4VV: Developing LLM-driven testsuite for compiler validation

DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions

Understanding Defects in Generated Codes by Language Models

Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing

Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs

Syntactic Robustness for LLM-based Code Generation

Enhancing Computer Programming Education with LLMs: A Study on Effective Prompt Engineering for Python Code Generation

VALTEST: Automated Validation of Language Model Generated Test Cases

Fixing Code Generation Errors for Large Language Models

Prompting Techniques for Secure Code Generation: A Systematic Investigation

What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models

Effective test generation using pre-trained Large Language Models and mutation testing

On the Evaluation of Large Language Models in Unit Test Generation

Examination of Code generated by Large Language Models

Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM

She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and Sustainable Language Models

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?