Abstract:Automated tests play an important role in software evolution because they can rapidly detect faults introduced during changes. In practice, code-coverage metrics are often used as criteria to evaluate the effectiveness of test suites with focus on regression faults. However, code coverage only expresses which portion of a system has been executed by tests, but not how effective the tests actually are in detecting regression faults. Our goal was to evaluate the validity of code coverage as a measure for test effectiveness. To do so, we conducted an empirical study in which we applied an extreme mutation testing approach to analyze the tests of open-source projects written in Java. We assessed the ratio of pseudo-tested methods (those tested in a way such that faults would not be detected) to all covered methods and judged their impact on the software project. The results show that the ratio of pseudo-tested methods is acceptable for unit tests but not for system tests (that execute large portions of the whole system). Therefore, we conclude that the coverage metric is only a valid effectiveness indicator for unit tests.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the effectiveness of code coverage as a measure of test effectiveness. Specifically, through empirical research, the author uses the extreme mutation testing method to analyze the test situations of open - source projects in order to evaluate the proportion of pseudo - testing methods (i.e., methods that are covered by tests but actually cannot detect errors) and their impacts on software projects. The main objective of the research is to determine whether code coverage can effectively reflect the effectiveness of test suites in detecting regression errors, especially the differences between unit tests and system tests. ### Research Background - **Automated testing** plays an important role in software evolution because it can quickly detect errors introduced during the change process. - **Code coverage** is usually used as a standard for evaluating the effectiveness of test suites, especially for regression errors. However, code coverage only indicates how many parts of the system have been executed by tests and cannot directly reflect the actual effectiveness of tests. ### Research Questions 1. **What is the proportion of pseudo - testing methods?** - Evaluate how many methods are covered by tests but actually not effectively tested. 2. **Does the proportion of pseudo - testing methods depend on the test type (unit test vs. system test)?** - Explore the impacts of different types of tests (unit tests and system tests) on the proportion of pseudo - testing methods. 3. **How severe are the pseudo - testing methods?** - Analyze the functional purposes and severities of pseudo - testing methods to understand the impacts of the lack of test effectiveness of these methods on the project. ### Research Methods - **Experimental Design**: Select 14 open - source projects and use Java for mutation testing analysis. - **Mutation Testing**: Generate mutants by deleting all the logic of methods and check whether test cases can detect these mutants. - **Data Collection**: Record the test coverage of each method and the proportion of pseudo - testing methods. ### Main Findings 1. **Proportion of Pseudo - Testing Methods**: - For most research objects, the proportion of pseudo - testing methods varies between 6% and 53%. - For example, the proportion of pseudo - testing methods in the Apache Commons Lang project is only 1.9%, while the proportion in the Predictor project is as high as 52.7%. 2. **Relationship between the Proportion of Pseudo - Testing Methods and Test Types**: - The average proportion of pseudo - testing methods in unit tests is 11.41% with a standard deviation of 6.42%. - The average proportion of pseudo - testing methods in system tests is 35.48% with a standard deviation of 20.60%. - The proportion of pseudo - testing methods in unit tests is relatively stable, while the proportion in system tests fluctuates more. 3. **Severity of Pseudo - Testing Methods**: - More than half of the pseudo - testing methods in 11 research objects are of medium or high severity. - For example, the Apache Commons Math project contains a large number of insignificant pseudo - testing methods, while other projects contain many important pseudo - testing methods. ### Conclusions - **Code coverage as an indicator of unit test effectiveness is reasonable**, but for system tests, code coverage is not an effective indicator because the proportion of pseudo - testing methods in system tests is high and fluctuates more. - **The existence of pseudo - testing methods has a significant impact on the quality of software projects**, especially those pseudo - testing methods that are functionally important and severe. ### Future Work - Further study different types of testing methods and testing strategies to improve the effectiveness of tests. - Explore more mutation testing techniques to reduce computational costs and false positive rates. - Apply the research results to closed - source systems to verify their general applicability.

Will My Tests Tell Me If I Break This Code?

Effective code coverage in compositional systematic dynamic testing

Do Pseudo Test Suites Lead to Inflated Correlation in Measuring Test Effectiveness?

Mind the Gap: The Difference Between Coverage and Mutation Score Can Guide Testing Efforts

Comparing Mutation Coverage Against Branch Coverage in an Industrial Setting

An Empirical Evaluation of Manually Created Equivalent Mutants

Mutation Testing in Evolving Systems: Studying the relevance of mutants to code evolution

Does mutation testing improve testing practices?

Assessing Effectiveness of Test Suites: What Do We Know and What Should We Do?

Predictive Mutation Testing

Measuring Software Testability via Automatically Generated Test Cases

An Empirical Study on the Effects of Code Visibility on Program Testability

MMT: Mutation Testing of Java Bytecode with Model Transformation -- An Illustrative Demonstration

Test suite effectiveness metric evaluation: what do we know and what should we do?

Does Unit-Tested Code Crash? A Case Study of Eclipse

A Comprehensive Study of Pseudo-tested Methods

An Empirical Study on Automated Test Generation Tools for Java: Effectiveness and Challenges

An Empirical Comparison of Mutant Selection Assessment Metrics

A Large-Scale Evaluation of Automated Unit Test Generation Using EvoSuite

A New Mutation Analysis Method for Testing Java Exception Handling

Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course