Abstract:Software defect prediction aims to predict defect-prone code regions automatically before defects are discovered. Accurate prediction helps software practitioners to prioritize their testing efforts. In recent decades, dozens of approaches have been put forward and acquired good results in this field. However, in practical scenarios, many projects have limited labeled instances; more than that, most of these labeled instances are nondefective. The lack of training data and class imbalance problem together bring serious challenges to software defect prediction tasks. So far, few of prevailing approaches can well handle these two difficulties simultaneously. One important reason is that they do not pay adequate attention to several key instances, which are difficult to classify in a small imbalanced dataset. This article introduces the concept of "instance hardness" to integrate various difficulties of imbalance classification tasks. Based on it, a novel imbalance learning framework named self-paced ensemble of ensembles (SPE<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="1.054ex" height="2.343ex" style="vertical-align: -0.171ex;" viewBox="0 -934.9 453.9 1008.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMAIN-32" x="0" y="513"></use></g></svg></span>) is proposed to perform software defect prediction. SPE<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="1.054ex" height="2.343ex" style="vertical-align: -0.171ex;" viewBox="0 -934.9 453.9 1008.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMAIN-32" x="0" y="513"></use></g></svg></span> aims to generate a strong ensemble of ensembles by self-paced harmonizing instance hardness via undersampling. Finally, SPE<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="1.054ex" height="2.343ex" style="vertical-align: -0.171ex;" viewBox="0 -934.9 453.9 1008.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMAIN-32" x="0" y="513"></use></g></svg></span> is extensively compared with eight imbalance learning approaches on ten open-source defect datasets. Experiments indicate that SPE<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="1.054ex" height="2.343ex" style="vertical-align: -0.171ex;" viewBox="0 -934.9 453.9 1008.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMAIN-32" x="0" y="513"></use></g></svg></span> improves the performance and achieves better and more significant F-measure values than its existing counterparts, based on Brunner's statistical significance test and Cliff's effect sizes.<svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJMAIN-32" d="M109 429Q82 429 66 447T50 491Q50 562 103 614T235 666Q326 666 387 610T449 465Q449 422 429 383T381 315T301 241Q265 210 201 149L142 93L218 92Q375 92 385 97Q392 99 409 186V189H449V186Q448 183 436 95T421 3V0H50V19V31Q50 38 56 46T86 81Q115 113 136 137Q145 147 170 174T204 211T233 244T261 278T284 308T305 340T320 369T333 401T340 431T343 464Q343 527 309 573T212 619Q179 619 154 602T119 569T109 550Q109 549 114 549Q132 549 151 535T170 489Q170 464 154 447T109 429Z"></path></defs></svg>

SHSE: A subspace hybrid sampling ensemble method for software defect number prediction

HYDRA: Massively Compositional Model for Cross-Project Defect Prediction

A Hybrid Sampling and Multi-Objective Optimization Approach for Enhanced Software Defect Prediction

Heterogeneous Defect Prediction with Two-Stage Ensemble Learning

A Software Defect Prediction Approach Based on Hybrid Feature Dimensionality Reduction

An Improved Semi-Supervised Learning Method for Software Defect Prediction.

Comparative Study of Ensemble Learning Methods in Just-in-time Software Defect Prediction

SPE$^{2}$: Self-Paced Ensemble of Ensembles for Software Defect Prediction

An empirical study of data sampling techniques for just-in-time software defect prediction

Software defect prediction ensemble learning algorithm based on adaptive variable sparrow search algorithm

Software Defect Prediction Approach Based on a Diversity Ensemble Combined With Neural Network

FSDNP:Feature Selection Method for Software Defect Number Prediction

An Empirical Study on the Effectiveness of Data Resampling Approaches for Cross-Project Software Defect Prediction

A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB

Software Defect Prediction Using Deep Q‐Learning Network‐Based Feature Extraction

A Novel Class-Imbalance Learning Approach for Both Within-Project and Cross-Project Defect Prediction.

Unsupervised Deep Domain Adaptation for Heterogeneous Defect Prediction

Software defect prediction based on nested-stacking and heterogeneous feature selection

Hybrid deep architecture for software defect prediction with improved feature set

Software defect prediction ensemble learning algorithm based on 2-step sparrow optimizing extreme learning machine

Predicting the precise number of software defects: Are we there yet?