Cross Version Defect Prediction with Representative Data Via Sparse Subset Selection

Zhou Xu,Shuai Li,Yutian Tang,Xiapu Luo,Tao Zhang,Jin Liu,Jun Xu
DOI: https://doi.org/10.1145/3196321.3196331
2018-01-01
Abstract:Software defect prediction aims at detecting the defect-prone software modules by mining historical development data from software repositories. If such modules are identified at the early stage of the development, it can save large amounts of resources. Cross Version Defect Prediction (CVDP) is a practical scenario by training the classification model on the historical data of the prior version and then predicting the defect labels of modules of the current version. However, software development is a constantly-evolving process which leads to the data distribution differences across versions within the same project. The distribution differences will degrade the performance of the classification model. In this paper, we approach this issue by leveraging a state-of-the-art Dissimilarity-based Sparse Subset Selection (DS3) method. This method selects a representative module subset from the prior version based on the pairwise dissimilarities between the modules of two versions and assigns each module of the current version to one of the representative modules. These selected modules can well represent the modules of the current version, thus mitigating the distribution differences. We evaluate the effectiveness of DS3 for CVDP performance on total 40 cross-version pairs from 56 versions of 15 projects with three traditional and two effort-aware indicators. The extensive experiments show that DS3 outperforms three baseline methods, especially in terms of two effort-aware indicators.
What problem does this paper attempt to address?