Rethinking the Influence of Source Code on Test Case Generation

Dong Huang,Jie M. Zhang,Mingzhe Du,Mark Harman,Heming Cui
2024-09-19
Abstract:Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: if the source code being tested is wrong, will large language models (LLMs) be misled when generating test cases? Specifically, the authors experimentally evaluated how the effectiveness (including accuracy, coverage, and defect - detection effect) of test cases generated by LLMs changes when the task description and wrong code are provided, compared with the cases where only the task description or the correct code is provided. This research aims to explore the influence of the correctness of the source code on the LLM's ability to generate test cases, so as to provide guidance and suggestions for using LLMs to generate automated tests. The main contributions of the paper include: - For the first time, systematically studied the influence of source code on test - case generation. - The experimental results show that providing the task description and the correct code can significantly improve the effect of test - case generation. For example, in the HumanEval dataset, when the task description and the correct code are provided, the average accuracy rate of test cases generated by the LLM is 80.45%, while when the task description and the wrong code are provided, this value drops to 57.12%. - Based on the observations, suggestions for developers and researchers to use LLMs for automatic test generation are proposed. In particular, it is found that LLM - based testing is more effective in protecting mature code from regression errors, but when applied to relatively immature code in the early stages of software development, it may "solidify" errors. In addition, more research is called for to improve the LLM's ability to resist wrong code in generating reliable and defect - revealing test cases.