Abstract:Abstract Effort‐Aware Defect Prediction (EADP) methods sort software modules based on the defect density and guide the testing team to inspect the modules with high defect density first. Previous studies indicated that some feature selection methods could improve the performance of Classification‐Based Defect Prediction (CBDP) models, and the Correlation‐based feature subset selection method with the Best First strategy (CorBF) performed the best. However, the practical benefits of feature selection methods on EADP performance are still unknown, and blindly employing the best‐performing CorBF method in CBDP to pre‐process the defect datasets may not improve the performance of EADP models but possibly result in performance degradation. To assess the impact of the feature selection techniques on EADP, a total of 24 feature selection methods with 10 classifiers embedded in a state‐of‐the‐art EADP model (CBS+) on the 41 PROMISE defect datasets were examined. We employ six evaluation metrics to assess the performance of EADP models comprehensively. The results show that (1) The impact of the feature selection methods varies in classifiers and datasets. (2) The four wrapper‐based feature subset selection methods with forwards search, that is, AdaBoost with Forwards Search, Deep Forest with Forwards Search, Random Forest with Forwards Search, and XGBoost with Forwards Search (XGBF) are better than other methods across the studied classifiers and the used datasets. And XGBF with XGBoost as the embedded classifier in CBS+ performs the best on the datasets. (3) The best‐performing CorBF method in CBDP does not perform well on the EADP task. (4) The selected features vary with different feature selection methods and different datasets, and the features noc (number of children), ic (inheritance coupling), cbo (coupling between object classes), and cbm (coupling between methods) are frequently selected by the four wrapper‐based feature subset selection methods with forwards search. (5) Using AdaBoost, deep forest, random forest, and XGBoost as the base classifiers embedded in CBS+ can achieve the best performance. In summary, we recommend the software testing team should employ XGBF with XGBoost as the embedded classifier in CBS+ to enhance the EADP performance.

Impact Evaluation of Significant Feature Set in Cross Project for Defect Prediction through Hybrid Feature Selection in Multiclass

HYDRA: Massively Compositional Model for Cross-Project Defect Prediction

Combined Classifier for Cross-Project Defect Prediction: an Extended Empirical Study.

An effective feature selection based cross-project defect prediction model for software quality improvement

FeSCH: A Feature Selection Method Using Clusters of Hybrid-data for Cross-Project Defect Prediction.

Cross‐project defect prediction method based on genetic algorithm feature selection

A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction

Cross-Project Defect Prediction Based on Two-Phase Feature Importance Amplification

Cross-Project Defect Prediction Considering Multiple Data Distribution Simultaneously

Towards Cross-Project Defect Prediction with Imbalanced Feature Sets

Cross-Project Software Defect Prediction Based on SMOTE and Deep Canonical Correlation Analysis

Class Imbalance Reduction and Centroid based Relevant Project Selection for Cross Project Defect Prediction

SDP-MTF: A Composite Transfer Learning and Feature Fusion for Cross-Project Software Defect Prediction

An investigation on the effect of cross project data for prediction accuracy

The Impact of Feature Selection Techniques on Effort-Aware Defect Prediction: an Empirical Study.

DSSDPP: Data Selection and Sampling Based Domain Programming Predictor for Cross-Project Defect Prediction

An Improved Method for Training Data Selection for Cross-Project Defect Prediction

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder

A study on cross-project fault prediction through resampling and feature reduction along with source projects selection

A Software Defect Prediction Approach Based on Hybrid Feature Dimensionality Reduction

Training data selection for imbalanced cross-project defect prediction