Rethinking the Influence of Source Code on Test Case Generation

Dong Huang,Jie M. Zhang,Mingzhe Du,Mark Harman,Heming Cui

2024-09-19

Abstract:Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.

Software Engineering,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: if the source code being tested is wrong, will large language models (LLMs) be misled when generating test cases? Specifically, the authors experimentally evaluated how the effectiveness (including accuracy, coverage, and defect - detection effect) of test cases generated by LLMs changes when the task description and wrong code are provided, compared with the cases where only the task description or the correct code is provided. This research aims to explore the influence of the correctness of the source code on the LLM's ability to generate test cases, so as to provide guidance and suggestions for using LLMs to generate automated tests. The main contributions of the paper include: - For the first time, systematically studied the influence of source code on test - case generation. - The experimental results show that providing the task description and the correct code can significantly improve the effect of test - case generation. For example, in the HumanEval dataset, when the task description and the correct code are provided, the average accuracy rate of test cases generated by the LLM is 80.45%, while when the task description and the wrong code are provided, this value drops to 57.12%. - Based on the observations, suggestions for developers and researchers to use LLMs for automatic test generation are proposed. In particular, it is found that LLM - based testing is more effective in protecting mature code from regression errors, but when applied to relatively immature code in the early stages of software development, it may "solidify" errors. In addition, more research is called for to improve the LLM's ability to resist wrong code in generating reliable and defect - revealing test cases.

Rethinking the Influence of Source Code on Test Case Generation

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

An Empirical Study of Code Generation Errors made by Large Language Models

Where Do Large Language Models Fail When Generating Code?

Design choices made by LLM-based test generators prevent them from finding bugs

LLM-Powered Test Case Generation for Detecting Tricky Bugs

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Fixing Code Generation Errors for Large Language Models

Test-Case-Driven Programming Understanding in Large Language Models for Better Code Generation

A Deep Dive into Large Language Model Code Generation Mistakes: What and Why?

An Exploratory Study on Using Large Language Models for Mutation Testing

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

On Evaluating the Efficiency of Source Code Generated by LLMs

Large Language Models of Code Fail at Completing Code with Potential Bugs

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

CodeJudge: Evaluating Code Generation with Large Language Models

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes