TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Kush Jain,Gabriel Synnaeve,Baptiste Rozière

2024-10-01

Abstract:Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there has been far less effort dedicated to benchmarking software testing, despite the strong correlation between well-tested software and effective bug detection. To address this gap, we create and release TestGenEval, a large-scale benchmark to measure test generation performance. Based on SWEBench, TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories. It covers initial tests authoring, test suite completion, and code coverage improvements. Test authoring simulates the process of a developer writing a test suite from scratch, while test completion mimics the scenario where a developer aims to improve the coverage of an existing test suite. We evaluate several popular models, with sizes ranging from 7B to 405B parameters. Our detailed analysis highlights TestGenEval's contribution to a comprehensive evaluation of test generation performance. In particular, models struggle to generate high-coverage test suites, with the best model, GPT-4o, achieving an average coverage of only 35.2%. This is primarily due to models struggling to reason about execution, and their frequent assertion errors when addressing complex code paths.

Software Engineering

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of insufficient benchmarking for automatic test generation and test completion in the field of software testing. Specifically: 1. **Limitations of existing benchmarks**: - Most existing code - generation model (LLM) benchmarks mainly focus on code - writing or code - completion tasks, ignoring the evaluation of software - testing performance. - Existing benchmarks are limited in scale and scope, usually covering only simple, self - contained programs, lacking large - scale test - generation benchmarks for real - world use cases. 2. **Importance of software testing**: - A high - quality test suite is crucial for finding consistency problems between system specifications and implementations. An ideal test suite should execute all code paths (high coverage) and capture regression errors in the code (high mutation score). - However, writing high - quality tests is very time - consuming and is often partially or completely ignored. 3. **Need for automated test generation**: - Research on automated test generation has been quite extensive, but existing benchmarks have not fully measured test - generation performance in large - scale projects, especially when dealing with complex code paths. - The test - completion task (i.e., adding more tests to an existing test suite to improve coverage) also lacks corresponding benchmarks. To solve these problems, the author introduced **TESTGENEVAL**, a large - scale benchmarking platform for evaluating the performance of test generation and test completion. It is built based on SWEBench and covers 68,647 tests from 1,210 code - and - test - file pairs in 11 well - maintained Python repositories. TESTGENEVAL includes the following two main tasks: - **Test generation**: Generate an entire test suite from scratch. - **Test completion**: Add new tests to an existing test suite to improve coverage. Through these tasks, TESTGENEVAL can more comprehensively evaluate the test - generation capabilities of different models in actual projects and reveal the challenges of existing models in handling complex code paths and improving test coverage.

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Using Large Language Models to Generate JUnit Tests: An Empirical Study

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

The Fault in our Stars: Quality Assessment of Code Generation Benchmarks

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

CodeGen-Test: An Automatic Code Generation Model Integrating Program Test Information

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools

DevEval: Evaluating Code Generation in Practical Software Projects

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

CodeT: Code Generation with Generated Tests

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Evaluating and Improving ChatGPT for Unit Test Generation

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

ReCode: Robustness Evaluation of Code Generation Models

Generating Unseen Code Tests In Infinitum

Unit Test Case Generation with Transformers and Focal Context