Abstract:Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based software testing techniques, particularly in the area of test case generation. Despite the growing interest, limited efforts have been made to thoroughly evaluate the actual capabilities of LLMs in this task. In this paper, we introduce TestBench, a benchmark for class-level LLM-based test case generation. We construct a dataset of 108 Java programs from 9 real-world, large-scale projects on GitHub, each representing a different thematic domain. We then design three distinct types of prompts based on context descriptions, including self-contained context, full context, and simple context. Besides, we propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate. Furthermore, we propose a heuristic algorithm to repair erroneous test cases generated by LLMs. We evaluate CodeLlama-13b, GPT-3.5, and GPT-4 on the TestBench, and our experimental results indicate that larger models demonstrate a greater ability to effectively utilize contextual information, thus generating higher-quality test cases. Smaller models may struggle with the noise introduced by the extensive information contained within the full context. However, when using the simplified version, namely the simple context, which is derived from the full context via abstract syntax tree analysis, the performance of these models improves significantly. Our analysis highlights the current progress and pinpoints future directions to further enhance the effectiveness of models by handling contextual information for test case generation.

CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++?

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

LLM4VV: Developing LLM-driven testsuite for compiler validation

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

FullStack Bench: Evaluating LLMs as Full Stack Coders

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Multi-language Unit Test Generation using LLMs

DebugBench: Evaluating Debugging Capability of Large Language Models

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

PyBench: Evaluating LLM Agent on various real-world coding tasks

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study