Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Wendkûuni C. Ouédraogo,Kader Kaboré,Haoye Tian,Yewei Song,Anil Koyuncu,Jacques Klein,David Lo,Tegawendé F. Bissyandé
2024-09-19
Abstract:Unit testing, crucial for ensuring the reliability of code modules, such as classes and methods, is often overlooked by developers due to time constraints. Automated test generation techniques have emerged to address this, but they frequently lack readability and require significant developer intervention. Large Language Models (LLMs), such as GPT and Mistral, have shown promise in software engineering tasks, including test generation, but their overall effectiveness remains unclear. This study presents an extensive investigation of LLMs, evaluating the effectiveness of four models and five prompt engineering techniques for unit test generation. We analyze 216 300 tests generated by the selected advanced instruct-tuned LLMs for 690 Java classes collected from diverse datasets. Our evaluation considers correctness, understandability, coverage, and test smell detection in the generated tests, comparing them to a widely used automated testing tool, EvoSuite. While LLMs demonstrate potential, improvements in test quality particularly in reducing common test smells are necessary. This study highlights the strengths and limitations of LLM-generated tests compared to traditional methods, paving the way for further research on LLMs in test automation.
Software Engineering
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: **Evaluating the effectiveness and potential of large - language models (LLMs) in generating unit test cases, and the impact of different prompt engineering techniques on their performance**. Specifically, the research aims to answer the following key questions: 1. **Grammatical Correctness and Compilability** (RQ1): - How do prompt engineering techniques affect the ability of LLMs to generate test suites that are grammatically correct and compilable? - Through appropriate prompt engineering, can LLMs be guided to generate test code that complies with programming language rules and can be successfully compiled? 2. **Readability and Maintainability** (RQ2): - Besides grammatical correctness, how do the test suites generated by LLMs perform in terms of human readability and maintainability? - Is the generated test code clear and easy to understand, facilitating developers to understand the test logic, maintain the test code, and integrate it into the existing codebase? 3. **Code Coverage** (RQ3): - How do the test suites generated by LLMs compare with those generated by search - based software testing (SBST) techniques in terms of code coverage? - Can the test cases generated by LLMs cover the code under test more comprehensively, thereby improving error - detection capabilities? 4. **Test Smells Detection** (RQ4): - What are the differences in the prevalence of test smells between the test suites generated by LLMs and those generated by EvoSuite? - Can LLMs generate test code with fewer test smells through prompt engineering, thereby improving the quality and maintainability of the test code? ### Research Background Unit testing plays a crucial role in software development as it can verify the correctness and reliability of individual functions or code units. However, manually creating unit tests is both cumbersome and time - consuming, causing many developers to overlook this important step. To solve this problem, automated test - generation techniques have emerged, but these techniques are often lacking in readability and require a great deal of human intervention. In recent years, large - language models (such as GPT and Mistral) have shown great potential in software engineering tasks, including automatically generating test cases. Nevertheless, the actual effectiveness of these models in generating high - quality unit tests remains unclear. Therefore, this paper conducts a large - scale, independent, and comprehensive study on four popular large - scale language models (GPT 3.5, GPT 4, Mistral 7B, and Mixtral 8x7B) and five prompt engineering techniques to evaluate their performance in generating unit test cases. ### Main Contributions 1. **First Systematic Evaluation**: This paper is the first study to systematically evaluate the impact of prompt engineering techniques on the performance of LLMs in generating unit test cases. 2. **Multi - dimensional Evaluation**: The research not only evaluates traditional metrics such as grammatical correctness and compilability but also examines code coverage and the prevalence of test smells. 3. **Comparison with Traditional Methods**: The test cases generated by LLMs are compared with those generated by the traditional SBST tool EvoSuite. 4. **Dataset Release**: To support future research, the authors release a dataset containing 216,300 test cases generated by four instruction - tuned LLMs, as well as performance - evaluation scripts. Through these efforts, this paper provides a solid foundation for evaluating and improving the application of LLMs in unit - test generation, paving the way for further exploration of their potential in software engineering.