Performance evaluation of software defect prediction with NASA dataset using machine learning techniques
Tamanna Siddiqui,Mohd Mustaqeem
DOI: https://doi.org/10.1007/s41870-023-01528-9
2023-10-04
International Journal of Information Technology
Abstract:The software industry’s growth and increasing complexity have made software maintenance more challenging, with Software Defects (SD) being a significant contributor to quality degradation leading to resource wastage in terms of effort, time, and finances. If the SD is not considered in the early stages of software development, it occurs in any stage of the Software Development Life Cycle (SDLC). The proposed study focuses on improving software quality through Software Defect Prediction (SDP) using machine learning (ML) and data balancing techniques. To mitigate the problem of imbalanced datasets, which often lead to model overfitting, the authors employ the Synthetic Minority Oversampling Technique (SMOTE) combined with ML approaches. The assessment encompasses various ML techniques, including Random Forest, SVM, KNN, and LDA, on the balanced CM1 dataset of the NASA promise repository and evaluated performance using accuracy, precision, recall, F1-score, and AUC-ROC. Random Forest emerges as a standout performer, with an accuracy of 98.09% and an F1-score of 97.25%. SVM and KNN also demonstrate high accuracy rates of 97.71% and 97.56%, respectively, while LDA shows balanced performance with an accuracy of 96.04% and an F1-score of 95.93%. Notably, our study achieves significant performance improvements compared with the prior state-of-the-art. The author provides a roadmap to achieve improved performance and predictive capabilities in SDP, highlighting the novel contribution of the proposed study. These findings hold great potential for the software industry, offering solutions to enhance software quality and streamline development.