Abstract:Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called ‘curse of dimensionality’. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature reduction techniques on defect prediction is investigated. In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work). The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.

Exploring better alternatives to size metrics for explainable software defect prediction

Explainable Software Defect Prediction from Cross Company Project Metrics Using Machine Learning

Understanding machine learning software defect predictions

Combined Classifier for Cross-Project Defect Prediction: an Extended Empirical Study.

Interpretability application of the Just-in-Time software defect prediction model

A software defect prediction method with metric compensation based on feature selection and transfer learning

Deep Learning for Just-In-Time Defect Prediction

Predicting the precise number of software defects: Are we there yet?

Predicting Defective Visual Code Changes in a Multi-Language AAA Video Game Project

How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction.

An Empirical Study of Model-Agnostic Techniques for Defect Prediction Models

Discriminating features-based cost-sensitive approach for software defect prediction

Revisiting the Impact of Dependency Network Metrics on Software Defect Prediction

Defect prediction with bad smells in code

The impact of feature reduction techniques on defect prediction models

Cross-Project Defect Prediction Considering Multiple Data Distribution Simultaneously

Defect Prediction With Semantics and Context Features of Codes Based on Graph Representation Learning

Multi-project Regression Based Approach for Software Defect Number Prediction

Explainable Software Defect Prediction: Are We There Yet?

Does class size matter? An in-depth assessment of the effect of class size in software defect prediction

Software Defect Prediction Based on Elman Neural Network and Cuckoo Search Algorithm