Abstract:Test smell refers to poor programming and design practices in testing and widely spreads throughout software projects. Considering test smells have negative impacts on the comprehension and maintenance of test code and even make code-under-test more defect-prone, it thus has great importance in mining, detecting, and refactoring them. Since Deursen et al. introduced the definition of “test smell”, several studies worked on discovering new test smells from test specifications and software practitioners’ experience. Indeed, many bad testing practices are “observed” by software developers during creating test scripts rather than through academic research and are widely discussed in the software engineering community (e.g., Stack Overflow) [ 70 , 94 ]. However, no prior studies explored new bad testing practices from software practitioners’ discussions, formally defined them as new test smell types, and analyzed their characteristics, which plays a bad role for developers in knowing these bad practices and avoiding using them during test code development. Therefore, we pick up those challenges and act by working on systematic methods to explore new test smell types from one of the most mainstream developers’ Q&A platforms, i.e., Stack Overflow. We further investigate the harmfulness of new test smells and analyze possible solutions for eliminating them. We find that some test smells make it hard for developers to fix failed test cases and trace their failing reasons. To exacerbate matters, we have identified two types of test smells that pose a risk to the accuracy of test cases. Next, we develop a detector to detect test smells from software. The detector is composed of six detection methods for different smell types. These detection methods are both wrapped with a set of syntactic rules based on the code patterns extracted from different test smells and developers’ code styles. We manually construct a test smell dataset from seven popular Java projects and evaluate the effectiveness of our detector on it. The experimental results show that our detector achieves high performance in precision, recall, and F1 score. Then, we utilize our detector to detect smells from 919 real-world Java projects to explore whether the six test smells are prevalent in practice. We observe that these test smells are widely spread in 722 out of 919 Java projects, which demonstrates that they are prevalent in real-world projects. Finally, to validate the usefulness of test smells in practice, we submit 56 issue reports to 53 real-world projects with different smells. Our issue reports achieve 76.4% acceptance by conducting sentiment analysis on developers’ replies. These evaluations confirm the effectiveness of our detector and the prevalence and practicality of new test smell types on real-world projects.

Evaluating Large Language Models in Detecting Test Smells

The Lost World: Characterizing and Detecting Undiscovered Test Smells.

Test smells in LLM-Generated Unit Tests

How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

On the Evaluation of Large Language Models in Unit Test Generation

Are We Testing or Being Tested? Exploring the Practical Applications of Large Language Models in Software Testing

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

Software Testing with Large Language Models: Survey, Landscape, and Vision

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Machine learning-based test smell detection

On the Effectiveness of LLMs for Manual Test Verifications

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

A Preliminary Study on Using Large Language Models in Software Pentesting

A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks

An Empirical Study of Large Language Models for Type and Call Graph Analysis

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Beyond Static Tools: Evaluating Large Language Models for Cryptographic Misuse Detection