Abstract:The role of regression testing in software testing is crucial as it ensures that any new modifications do not disrupt the existing functionality and behaviour of the software system. The desired outcome is for regression tests to yield identical results without any modifications made to the system being tested. In practice, however, the presence of Flaky Tests introduces non-deterministic behaviour and undermines the reliability of regression testing results. In this paper, we propose an LLM-based approach for identifying the root cause of flaky tests in C++ projects at the code level, with the intention of assisting developers in debugging and resolving them more efficiently. We compile a comprehensive collection of C++ project flaky tests sourced from GitHub repositories. We fine-tune Mistral-7b, Llama2-7b and CodeLlama-7b models on the C++ dataset and an existing Java dataset and evaluate the performance in terms of precision, recall, accuracy, and F1 score. We assess the performance of the models across various datasets and offer recommendations for both research and industry applications. The results indicate that our models exhibit varying performance on the C++ dataset, while their performance is comparable to that of the Java dataset. The Mistral-7b surpasses the other two models regarding all metrics, achieving a score of 1. Our results demonstrate the exceptional capability of LLMs to accurately classify flakiness in C++ and Java projects, providing a promising approach to enhance the efficiency of debugging flaky tests in practice.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the Flaky Tests (unstable tests) problem in regression testing in C++ projects. Specifically, the paper aims to use large - language models (LLMs) to identify the root causes of test instability in C++ projects, thereby helping developers debug and solve problems more efficiently. ### Problem Background In software testing, regression testing is to ensure that new modifications will not break the functions and behaviors of the existing system. Ideally, the results of regression testing should be consistent, even if no modifications are made to the system. However, in actual operation, due to the existence of Flaky Tests, that is, those test cases that pass or fail randomly without depending on code changes, this makes the regression testing results unreliable. The existence of Flaky Tests introduces non - deterministic behavior and affects the reliability of regression testing. ### Paper Objectives To solve this problem, the paper proposes a method based on large - language models to identify the root causes of Flaky Tests in C++ projects at the code level. Specific objectives include: 1. **Constructing a Dataset**: Collect and organize a Flaky Tests dataset of C++ projects from GitHub repositories. 2. **Model Fine - Tuning**: Use three large - language models, Mistral - 7b, Llama2 - 7b, and CodeLlama - 7b, to fine - tune the C++ dataset and evaluate their performance. 3. **Performance Evaluation**: Evaluate the performance of the models through metrics such as Precision, Recall, Accuracy, and F1 - score, and compare it with the performance on the existing Java dataset. 4. **Cross - Language Comparison**: Explore the classification effects of these models in C++ and Java projects, and provide recommendations for future research and industrial applications. ### Main Contributions - **Dataset Release**: Provide the first publicly available C++ Flaky Tests dataset, which is of great value for further research on Flaky Tests in C++. - **Method Innovation**: Propose a Flaky Tests classification method that is applicable to both Java and C++, two popular programming languages. - **Model Comparison**: Compare the performance of multiple state - of - the - art large - language models in the Flaky Tests classification task, providing guidance for future model selection. ### Research Questions The paper mainly answers two research questions: 1. **RQ1**: How accurate is our method in predicting the Flaky Tests categories in C++ projects? How does it perform compared to the existing Java dataset? 2. **RQ1**: How does our method perform in the Flaky Tests classification task of Java projects compared to previous work? ### Results Summary The experimental results show that the Mistral - 7b model performs excellently on the C++ dataset, with all metrics reaching 1.0; while on the Java dataset, the Llama2 - 7b model is slightly better, with an F1 - score of 0.89. Overall, these large - language models show excellent capabilities in handling the Flaky Tests classification tasks in C++ and Java projects, especially the Mistral - 7b and Llama2 - 7b models. ### Summary This paper solves the Flaky Tests problem in C++ projects by introducing large - language models, which not only improves the reliability and efficiency of testing but also provides valuable data and method support for future research.

A Large Language Model Approach to Identify Flakiness in C++ Projects

Idflakies: A Framework for Detecting and Partially Classifying Flaky Tests

Impact of Large Language Models of Code on Fault Localization

Practical Flaky Test Prediction using Common Code Evolution and Test History Data

FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

Code Linting using Language Models

Large Language Models for Test-Free Fault Localization

Leveraging Large Language Models for Efficient Failure Analysis in Game Development

DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests

Automating Quantum Software Maintenance: Flakiness Detection and Root Cause Analysis

Evaluating Large Language Models in Detecting Test Smells

Evaluation of large language models for assessing code maintainability

An Evalutation of Programming Language Models' performance on Software Defect Detection

Exploring Automated Assertion Generation Via Large Language Models

A large-scale longitudinal study of flaky tests

Are Large Language Models Good Statisticians?

Are Large Language Models Memorizing Bug Benchmarks?

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

An Exploratory Study on Using Large Language Models for Mutation Testing

Software Testing with Large Language Models: Survey, Landscape, and Vision