Impact Evaluation of Significant Feature Set in Cross Project for Defect Prediction through Hybrid Feature Selection in Multiclass

Sana Gul,Rizwan Bin Faiz,Mohammad Aljaidi,Ghassan Samara,Ayoub Alsarhan,Ahmad al-Qerem,Aljaidi,M.,Gul,S.,Faiz,R.,Samara,G.,Alsarhan,A.,al-Qerem,A.
DOI: https://doi.org/10.1101/2023.07.20.549868
2023-07-22
bioRxiv
Abstract:Cross-project defect prediction (CPDP) is a significant way of defect identification in the project. In cross-project defect prediction, we extract knowledge from the source project and apply that learned knowledge to predict labels for the target project. However, the model performance can be affected by features that are insignificant and irrelevant. Hybrid feature selection (HFS) can play a significant role in achieving high prediction accuracy by selecting significant and only relevant features. Our aim is to explore effect of significant feature selection through a hybrid approach upon cross-project (CP) defect prediction for datasets which are multi-class in nature. We leveraged the strengths of Random Forest (RF) and Recursive Feature Elimination Cross Validation (RFECV) which can constructively select few features which are significant. The design of our controlled experiment is 1 Factor 2 Treatments (1F2T). Exploratory Data Analysis (EDA) proves that all versions of PROMISE repository are multi class and have duplicated rows in data, distribution gap among values, and imbalance classes. Hence after removing duplicated rows, reducing the gap present in distribution of data, and balancing classes, we selected significant feature set through Hybrid approach i.e. Random Forest (RF) and Recursive Feature Elimination Cross Validation (RFECV). We used Convolutional Neural Network (CNN) as a classifier to predict Cross project defects along with SoftMax as the last layer. Our experimental setup resulted in the average 78% prediction accuracy measure of all 14 versions in terms of AUC. Our experimental result showed that there is significant impact of HFS on defect prediction accuracy for different datasets in the CP.
What problem does this paper attempt to address?