SHSE: A subspace hybrid sampling ensemble method for software defect number prediction

Haonan Tong,Wei Lu,Weiwei Xing,Bin Liu,Shihai Wang
DOI: https://doi.org/10.1016/j.infsof.2021.106747
IF: 3.9
2022-02-01
Information and Software Technology
Abstract:Context: Software defect number prediction (SDNP) helps allocate limited testing resources by ranking software modules according to the predicted defect numbers. However, the highly skewed distribution of defects greatly degrades the performance of SDNP models by preventing SDNP models from ranking software modules accurately. Objective: This paper introduces a novel subspace hybrid sampling ensemble (SHSE) method based on feature subspace construction, hybrid sampling, and ensemble learning for building high-performance SDNP models. Method: Specifically, we first construct a series of feature subspace to ensure the diversity of base learners. In each of feature subspace, we then use the proposed hybrid sampling method to balance the training subset without losing too much information and introducing lots of noisy data caused by only using undersampling or oversampling techniques. Finally, we train each base learner and combine them by using the proposed weighted ensemble strategy. Experiments are performed on 27 public defect datasets. We compare SHSE with five state-of-the-art resampling-based models and four zero-inflated/hurdle models in terms of the ranking performance measure fault-percentile-average (FPA). To demonstrate the effectiveness of SHSE, two statistical testing methods including Wilcoxon Signed-rank test and Scott–Knott Effect Size Difference test are utilized. Cliff’s δ is also computed for quantifying the difference when there is significant difference between SHSE and each baseline. Results: The experimental results show that SHSE significantly outperforms the baselines and improves the performance over each baseline with as least medium effect size on most datasets. On average, SHSE improves the performance over the resampling-based methods by 8.7% ∼ 14.4% and the zero-inflate/hurdle models by 10.3% ∼ 15.2%. Conclusion: It can be concluded that SHSE is a more promising alternative for software defect number prediction.
computer science, information systems, software engineering
What problem does this paper attempt to address?