An Empirical Study of Software Change Classification with Imbalance Data‐handling Methods

Xiaoyan Zhu,Binbin Niu,E. James Whitehead,Zhongbin Sun
DOI: https://doi.org/10.1002/spe.2606
2018-01-01
Abstract:SummaryBug prediction in software code changes can help developers to find out and fix bugs immediately when they are introduced, thus to improve the effectiveness and validity of bug fixing. In data mining, this problem can be regarded as a change classification task. However, one of its key characteristics, ie, class‐imbalance, holds back the performance of standard classification methods. In this paper, we consider a quantity of imbalance data‐handling methods and extract a more comprehensive groups of change features, aiming to achieve better change classification performance. Two different types of imbalance data‐handling methods, namely, resampling and ensemble learning methods, are employed. Especially, we explore the performance of their combination. To compare the performance of different imbalance data‐handling methods, an experiment with 10 open source projects is conducted. Four classification methods, including J48, Naïve Bayes, SMO, and Random Forest, are used as standard classifiers and as the base classifiers, respectively. Moreover, contribution of different groups of change features are evaluated. Experimental results show that imbalance data‐handling methods can improve the performance of change classification and the combination methods, which take advantage of both ensemble learning and resampling, perform better than using ensemble learning methods or resampling methods individually. Of the studied imbalance data‐handling methods, the combination of Bagging and random undersampling with J48 as the base classifier yields out better prediction results than those achieved by other methods. Additionally, of the collected change features, text vector features accounts for a larger proportion than others.
What problem does this paper attempt to address?