Abstract:Unit testing, crucial for ensuring the reliability of code modules, such as classes and methods, is often overlooked by developers due to time constraints. Automated test generation techniques have emerged to address this, but they frequently lack readability and require significant developer intervention. Large Language Models (LLMs), such as GPT and Mistral, have shown promise in software engineering tasks, including test generation, but their overall effectiveness remains unclear. This study presents an extensive investigation of LLMs, evaluating the effectiveness of four models and five prompt engineering techniques for unit test generation. We analyze 216 300 tests generated by the selected advanced instruct-tuned LLMs for 690 Java classes collected from diverse datasets. Our evaluation considers correctness, understandability, coverage, and test smell detection in the generated tests, comparing them to a widely used automated testing tool, EvoSuite. While LLMs demonstrate potential, improvements in test quality particularly in reducing common test smells are necessary. This study highlights the strengths and limitations of LLM-generated tests compared to traditional methods, paving the way for further research on LLMs in test automation.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: **Evaluating the effectiveness and potential of large - language models (LLMs) in generating unit test cases, and the impact of different prompt engineering techniques on their performance**. Specifically, the research aims to answer the following key questions: 1. **Grammatical Correctness and Compilability** (RQ1): - How do prompt engineering techniques affect the ability of LLMs to generate test suites that are grammatically correct and compilable? - Through appropriate prompt engineering, can LLMs be guided to generate test code that complies with programming language rules and can be successfully compiled? 2. **Readability and Maintainability** (RQ2): - Besides grammatical correctness, how do the test suites generated by LLMs perform in terms of human readability and maintainability? - Is the generated test code clear and easy to understand, facilitating developers to understand the test logic, maintain the test code, and integrate it into the existing codebase? 3. **Code Coverage** (RQ3): - How do the test suites generated by LLMs compare with those generated by search - based software testing (SBST) techniques in terms of code coverage? - Can the test cases generated by LLMs cover the code under test more comprehensively, thereby improving error - detection capabilities? 4. **Test Smells Detection** (RQ4): - What are the differences in the prevalence of test smells between the test suites generated by LLMs and those generated by EvoSuite? - Can LLMs generate test code with fewer test smells through prompt engineering, thereby improving the quality and maintainability of the test code? ### Research Background Unit testing plays a crucial role in software development as it can verify the correctness and reliability of individual functions or code units. However, manually creating unit tests is both cumbersome and time - consuming, causing many developers to overlook this important step. To solve this problem, automated test - generation techniques have emerged, but these techniques are often lacking in readability and require a great deal of human intervention. In recent years, large - language models (such as GPT and Mistral) have shown great potential in software engineering tasks, including automatically generating test cases. Nevertheless, the actual effectiveness of these models in generating high - quality unit tests remains unclear. Therefore, this paper conducts a large - scale, independent, and comprehensive study on four popular large - scale language models (GPT 3.5, GPT 4, Mistral 7B, and Mixtral 8x7B) and five prompt engineering techniques to evaluate their performance in generating unit test cases. ### Main Contributions 1. **First Systematic Evaluation**: This paper is the first study to systematically evaluate the impact of prompt engineering techniques on the performance of LLMs in generating unit test cases. 2. **Multi - dimensional Evaluation**: The research not only evaluates traditional metrics such as grammatical correctness and compilability but also examines code coverage and the prevalence of test smells. 3. **Comparison with Traditional Methods**: The test cases generated by LLMs are compared with those generated by the traditional SBST tool EvoSuite. 4. **Dataset Release**: To support future research, the authors release a dataset containing 216,300 test cases generated by four instruction - tuned LLMs, as well as performance - evaluation scripts. Through these efforts, this paper provides a solid foundation for evaluating and improving the application of LLMs in unit - test generation, paving the way for further exploration of their potential in software engineering.

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Test smells in LLM-Generated Unit Tests

On the Evaluation of Large Language Models in Unit Test Generation

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites

Optimizing Search-Based Unit Test Generation with Large Language Models: an Empirical Study

LLM4VV: Developing LLM-driven testsuite for compiler validation

Exploring Automated Assertion Generation Via Large Language Models

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Examination of Code generated by Large Language Models

On the Effectiveness of LLMs for Manual Test Verifications

Test-Driven Development for Code Generation

Leveraging Large Language Models for Automated Web-Form-Test Generation: An Empirical Study

Using Large Language Models to Generate JUnit Tests: An Empirical Study

Evaluating Large Language Models in Class-Level Code Generation

An Exploratory Study on Using Large Language Models for Mutation Testing

Comprehensive Evaluation and Insights into the Use of Large Language Models in the Automation of Behavior-Driven Development Acceptance Test Formulation