Abstract:The performance of software defect prediction (SDP) models determines the priority of test resource allocation. Researchers also use interpretability techniques to gain empirical knowledge about software quality from SDP models. However, SDP methods designed in the past research rarely consider the impact of data transformation methods, simple but commonly used preprocessing techniques, on the performance and interpretability of SDP models. Therefore, in this paper, we investigate the impact of three data transformation methods (Log, Minmax, and Z-score) on the performance and interpretability of SDP models. Through empirical research on (i) six classification techniques (random forest, decision tree, logistic regression, Naive Bayes, K-nearest neighbors, and multilayer perceptron), (ii) six performance evaluation indicators (Accuracy, Precision, Recall, F1, MCC, and AUC), (iii) two interpretable methods (permutation and SHAP), (iv) two feature importance measures (Top-k feature rank overlap and difference), and (v) three datasets (Promise, Relink, and AEEEM), our results show that the data transformation methods can significantly improve the performance of the SDP models and greatly affect the variation of the most important features. Specifically, the impact of data transformation methods on the performance and interpretability of SDP models depends on the classification techniques and evaluation indicators. We observe that log transformation improves NB model performance by 7%–61% on the other five indicators with a 5% drop in Precision. Minmax and Z-score transformation improves NB model performance by 2%–9% across all indicators. However, all three transformation methods lead to substantial changes in the Top-5 important feature ranks, with differences exceeding 2 in 40%–80% of cases (detailed results available in the main content). Based on our findings, we recommend that (1) considering the impact of data transformation methods on model performance and interpretability when designing SDP approaches as transformations can improve model accuracy, and potentially obscure important features, which lead to challenges in interpretation, (2) conducting comparative experiments with and without the transformations to validate the effectiveness of proposed methods which are designed to improve the prediction performance, and (3) tracking changes in the most important features before and after applying data transformation methods to ensure precise and traceable interpretability conclusions to gain insights. Our study reminds researchers and practitioners of the need for comprehensive considerations even when using other similar simple data processing methods.

The effect of data complexity on classifier performance

Unifying Defect Prediction, Categorization, and Repair by Multi-Task Deep Learning

Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction Tasks

Automatic Defect Categorization Based on Fault Triggering Conditions

Combined Classifier for Cross-Project Defect Prediction: an Extended Empirical Study.

A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction

Cascade Generalization-based Classifiers for Software Defect Prediction

On the Ability of Complexity Metrics to Predict Fault-Prone Classes in Object-Oriented Systems

Hybrid deep architecture for software defect prediction with improved feature set

The Impact of Using Regression Models to Build Defect Classifiers

Evaluating the Impact of Data Transformation Techniques on the Performance and Interpretability of Software Defect Prediction Models

A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

Predicting defects in object-oriented software using cost-sensitive classification

Is Deep Learning Good Enough for Software Defect Prediction?

Does class size matter? An in-depth assessment of the effect of class size in software defect prediction

Performance evaluation of software defect prediction with NASA dataset using machine learning techniques

Empirical Evaluation of the Impact of Class Overlap on Software Defect Prediction

ELM and KELM based software defect prediction using feature selection techniques

The Integrity of Machine Learning Algorithms against Software Defect Prediction

The Impact of Feature Importance Methods on the Interpretation of Defect Classifiers

The Impact of Dormant Defects on Defect Prediction: A Study of 19 Apache Projects