TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Kush Jain,Gabriel Synnaeve,Baptiste Rozière
2024-10-01
Abstract:Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there has been far less effort dedicated to benchmarking software testing, despite the strong correlation between well-tested software and effective bug detection. To address this gap, we create and release TestGenEval, a large-scale benchmark to measure test generation performance. Based on SWEBench, TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories. It covers initial tests authoring, test suite completion, and code coverage improvements. Test authoring simulates the process of a developer writing a test suite from scratch, while test completion mimics the scenario where a developer aims to improve the coverage of an existing test suite. We evaluate several popular models, with sizes ranging from 7B to 405B parameters. Our detailed analysis highlights TestGenEval's contribution to a comprehensive evaluation of test generation performance. In particular, models struggle to generate high-coverage test suites, with the best model, GPT-4o, achieving an average coverage of only 35.2%. This is primarily due to models struggling to reason about execution, and their frequent assertion errors when addressing complex code paths.
Software Engineering
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of insufficient benchmarking for automatic test generation and test completion in the field of software testing. Specifically: 1. **Limitations of existing benchmarks**: - Most existing code - generation model (LLM) benchmarks mainly focus on code - writing or code - completion tasks, ignoring the evaluation of software - testing performance. - Existing benchmarks are limited in scale and scope, usually covering only simple, self - contained programs, lacking large - scale test - generation benchmarks for real - world use cases. 2. **Importance of software testing**: - A high - quality test suite is crucial for finding consistency problems between system specifications and implementations. An ideal test suite should execute all code paths (high coverage) and capture regression errors in the code (high mutation score). - However, writing high - quality tests is very time - consuming and is often partially or completely ignored. 3. **Need for automated test generation**: - Research on automated test generation has been quite extensive, but existing benchmarks have not fully measured test - generation performance in large - scale projects, especially when dealing with complex code paths. - The test - completion task (i.e., adding more tests to an existing test suite to improve coverage) also lacks corresponding benchmarks. To solve these problems, the author introduced **TESTGENEVAL**, a large - scale benchmarking platform for evaluating the performance of test generation and test completion. It is built based on SWEBench and covers 68,647 tests from 1,210 code - and - test - file pairs in 11 well - maintained Python repositories. TESTGENEVAL includes the following two main tasks: - **Test generation**: Generate an entire test suite from scratch. - **Test completion**: Add new tests to an existing test suite to improve coverage. Through these tasks, TESTGENEVAL can more comprehensively evaluate the test - generation capabilities of different models in actual projects and reveal the challenges of existing models in handling complex code paths and improving test coverage.