An Exploratory Study on Using Large Language Models for Mutation Testing

Bo Wang,Mingda Chen,Youfang Lin,Mike Papadakis,Jie M. Zhang
2024-09-14
Abstract:Mutation testing is a foundation approach in the software testing field, based on automatically seeded small syntactic changes, known as mutations. The question of how to generate high-utility mutations, to be used for testing purposes, forms a key challenge in mutation testing literature. Large Language Models (LLMs) have shown great potential in code-related tasks but their utility in mutation testing remains unexplored. To this end, we systematically investigate the performance of LLMs in generating effective mutations w.r.t. to their usability, fault detection potential, and relationship with real bugs. In particular, we perform a large-scale empirical study involving six LLMs, including both state-of-the-art open- and closed-source models, and 851 real bugs on two Java benchmarks (i.e., 605 bugs from 12 projects of Defects4J 2.0 and 246 bugs of ConDefects). We find that compared to existing approaches, LLMs generate more diverse mutations that are behaviorally closer to real bugs, which leads to approximately 19% higher fault detection than current approaches (i.e., 93% vs. 74%). Nevertheless, the mutants generated by LLMs have worse compilability rate, useless mutation rate, and equivalent mutation rate than those generated by rule-based approaches. This paper also examines alternative prompt engineering strategies and identifies the root causes of uncompilable mutations, providing insights for researchers to further enhance the performance of LLMs in mutation testing.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate effective mutants in the field of software testing. Specifically, the paper explores the application of large - language models (LLMs) in mutation testing, aiming to generate mutants that are more effective, more diverse, and closer to real - errors in behavior through these models, so as to improve the fault - detection rate. Traditional methods, such as rule - based mutation - generation methods, can generate a large number of mutants, but these mutants often contain a large amount of redundancy, uselessness or uncompilable situations, resulting in large computational overhead and low efficiency. Therefore, the paper hopes to overcome the limitations of existing methods and improve the effectiveness and practicality of mutation testing by using the capabilities of LLMs. The main contributions of the paper include: - **Evaluating the applicability of LLMs in mutation generation**: Through extensive comparative experiments, evaluate the performance of LLMs relative to existing tools and methods, and find that the GPT - 4 model performs excellently in generating mutants close to real - errors. - **Comparison of different prompting strategies**: The research finds that using few - shot learning and providing appropriate code context can achieve the best performance. - **Error - type analysis of non - compilable mutations**: It is determined that member evaluation and method invocation are more likely to cause LLMs to generate uncompilable mutants. - **Constructing a high - quality mutation data set**: Create a comprehensively annotated Java mutation data set, which can be used not only for mutation testing, but also for other defect - injection applications, such as fault location and fault prediction. Overall, the paper aims to explore the potential of LLMs in mutation testing and provides valuable insights and data support for future research.