TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang,Chenyuan Yang,Zhijie Wang,Yuheng Huang,Zhaoyang Chu,Da Song,Lingming Zhang,An Ran Chen,Lei Ma
2024-06-07
Abstract:Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities. In this paper, we propose TESTEVAL, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate sixteen popular LLMs, including both commercial and open-source ones, on TESTEVAL. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths. We have open-sourced our dataset and benchmark pipelines at <a class="link-external link-https" href="https://llm4softwaretesting.github.io" rel="external noopener nofollow">this https URL</a> to contribute and accelerate future research on LLMs for software testing.
Software Engineering
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Overall Coverage**: Assessing the ability of large language models (LLMs) to generate test cases that cover most of the code lines or branches in a given program. 2. **Specific Line/Branch Coverage**: Evaluating the ability of LLMs to generate test cases for specific code lines or branches. 3. **Specific Path Coverage**: Assessing the ability of LLMs to generate test cases that can cover specific execution paths. Specifically, researchers have found that current LLMs face some challenges in generating test cases, particularly in understanding program logic and execution paths. To fill this gap, they proposed a new benchmark framework called TESTEVAL, which is used to evaluate the performance of different LLMs in generating software test cases. This benchmark includes three tasks: overall coverage, specific line/branch coverage, and specific path coverage, and it involves extensive experimental evaluation of 16 different LLMs. The results show that although state-of-the-art LLMs can generate executable and diverse test cases to some extent, they still struggle with identifying specific statements or branches that need to be covered.