An Exploratory Study on Using Large Language Models for Mutation Testing

Bo Wang,Mingda Chen,Youfang Lin,Mike Papadakis,Jie M. Zhang

2024-09-14

Abstract:Mutation testing is a foundation approach in the software testing field, based on automatically seeded small syntactic changes, known as mutations. The question of how to generate high-utility mutations, to be used for testing purposes, forms a key challenge in mutation testing literature. Large Language Models (LLMs) have shown great potential in code-related tasks but their utility in mutation testing remains unexplored. To this end, we systematically investigate the performance of LLMs in generating effective mutations w.r.t. to their usability, fault detection potential, and relationship with real bugs. In particular, we perform a large-scale empirical study involving six LLMs, including both state-of-the-art open- and closed-source models, and 851 real bugs on two Java benchmarks (i.e., 605 bugs from 12 projects of Defects4J 2.0 and 246 bugs of ConDefects). We find that compared to existing approaches, LLMs generate more diverse mutations that are behaviorally closer to real bugs, which leads to approximately 19% higher fault detection than current approaches (i.e., 93% vs. 74%). Nevertheless, the mutants generated by LLMs have worse compilability rate, useless mutation rate, and equivalent mutation rate than those generated by rule-based approaches. This paper also examines alternative prompt engineering strategies and identifies the root causes of uncompilable mutations, providing insights for researchers to further enhance the performance of LLMs in mutation testing.

Software Engineering

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate effective mutants in the field of software testing. Specifically, the paper explores the application of large - language models (LLMs) in mutation testing, aiming to generate mutants that are more effective, more diverse, and closer to real - errors in behavior through these models, so as to improve the fault - detection rate. Traditional methods, such as rule - based mutation - generation methods, can generate a large number of mutants, but these mutants often contain a large amount of redundancy, uselessness or uncompilable situations, resulting in large computational overhead and low efficiency. Therefore, the paper hopes to overcome the limitations of existing methods and improve the effectiveness and practicality of mutation testing by using the capabilities of LLMs. The main contributions of the paper include: - **Evaluating the applicability of LLMs in mutation generation**: Through extensive comparative experiments, evaluate the performance of LLMs relative to existing tools and methods, and find that the GPT - 4 model performs excellently in generating mutants close to real - errors. - **Comparison of different prompting strategies**: The research finds that using few - shot learning and providing appropriate code context can achieve the best performance. - **Error - type analysis of non - compilable mutations**: It is determined that member evaluation and method invocation are more likely to cause LLMs to generate uncompilable mutants. - **Constructing a high - quality mutation data set**: Create a comprehensively annotated Java mutation data set, which can be used not only for mutation testing, but also for other defect - injection applications, such as fault location and fault prediction. Overall, the paper aims to explore the potential of LLMs in mutation testing and provides valuable insights and data support for future research.

An Exploratory Study on Using Large Language Models for Mutation Testing

LLMorpheus: Mutation Testing using Large Language Models

Effective test generation using pre-trained Large Language Models and mutation testing

Large Language Models for Equivalent Mutant Detection: How Far Are We?

Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Exploring Automated Assertion Generation Via Large Language Models

Enhancing Genetic Improvement Mutations Using Large Language Models

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Efficient Mutation Testing via Pre-Trained Language Models

On the Evaluation of Large Language Models in Unit Test Generation

Evaluation and Improvement of Fault Detection for Large Language Models

Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction

Software Testing with Large Language Models: Survey, Landscape, and Vision

LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing

Fine-Tuning LLMs for Code Mutation: A New Era of Cyber Threats

Controlling the Mutation in Large Language Models for the Efficient Evolution of Algorithms