Retrieval-Augmented Test Generation: How Far Are We?

Jiho Shin,Reem Aleithan,Hadi Hemmati,Song Wang
2024-09-19
Abstract:Retrieval Augmented Generation (RAG) has shown notable advancements in software engineering tasks. Despite its potential, RAG's application in unit test generation remains under-explored. To bridge this gap, we take the initiative to investigate the efficacy of RAG-based LLMs in test generation. As RAGs can leverage various knowledge sources to enhance their performance, we also explore the impact of different sources of RAGs' knowledge bases on unit test generation to provide insights into their practical benefits and limitations. Specifically, we examine RAG built upon three types of domain knowledge: 1) API documentation, 2) GitHub issues, and 3) StackOverflow Q&As. Each source offers essential knowledge for creating tests from different perspectives, i.e., API documentations provide official API usage guidelines, GitHub issues offer resolutions of issues related to the APIs from the library developers, and StackOverflow Q&As present community-driven solutions and best practices. For our experiment, we focus on five widely used and typical Python-based machine learning (ML) projects, i.e., TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost to build, train, and deploy complex neural networks efficiently. We conducted experiments using the top 10% most widely used APIs across these projects, involving a total of 188 APIs. We investigate the effectiveness of four state-of-the-art LLMs (open and closed-sourced), i.e., GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llamma 3.1 405B. Additionally, we compare three prompting strategies in generating unit test cases for the experimental APIs, i.e., zero-shot, a Basic RAG, and an API-level RAG on the three external sources. Finally, we compare the cost of different sources of knowledge used for the RAG.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the effectiveness of using Retrieval - Augmented Generation (RAG) technology in unit test generation and its performance under different knowledge sources. Specifically, the authors focus on the following points: 1. **Exploration of the application of RAG in unit test generation**: Although RAG has shown significant progress in software engineering tasks, its application in the field of unit test generation is still in the exploratory stage. The authors hope to fill this gap through this research and evaluate the effect of RAG in unit test generation. 2. **The influence of different knowledge sources on RAG**: The authors explore the influence of different types of external knowledge sources (such as API documents, GitHub issues, StackOverflow Q&A) on the performance of RAG, in order to provide insights into its actual benefits and limitations. 3. **Improving unit test coverage**: By using RAG, the authors hope to increase the line coverage of the generated unit tests, thereby improving the quality and effectiveness of the tests. 4. **Cost - benefit analysis**: In addition to evaluating the technical advantages of RAG, the authors also focus on the cost - benefit of different RAG configurations, especially the cost impact of different knowledge sources and prompting strategies on generating test cases. 5. **Manual analysis**: In order to gain a deeper understanding of the actual effect of RAG in unit test generation, the authors also conduct a manual analysis to evaluate the specific impact of different strategies on the software under test. In summary, the main objective of this paper is to evaluate the effectiveness and potential of RAG in unit test generation, explore the influence of different knowledge sources on the performance of RAG, and ultimately provide valuable references for future related research.