A Data Filtering Method Based on Agglomerative Clustering

Xiao Yu,Peipei Zhou,Jiansheng Zhang,Jin Liu
DOI: https://doi.org/10.18293/SEKE2017-043
2017-07-05
Abstract:Cross-company defect prediction (CCDP) is a practical way that trains a prediction model by exploiting one or multiple projects of a source company and then applies the model to target company. Unfortunately, larger irrelevant crosscompany (CC) data usually makes it difficult to build a crosscompany defect prediction model with high performance. To address such issues, this paper proposes a data filtering method based on Agglomerative Clustering (DFAC) for cross-company defect prediction. First, DFAC combines within-company instances and cross-company instances and uses Agglomerative clustering algorithms to group these instances. Second, DFAC selects sub-clusters which consist at least one WC instance, and collects the CC instances in the selected sub-clusters into a new CC data. Compared with existing data filter methods, the experimental results on 15 public PROMISE datasets show that DFAC increases PD value, reduces PF value and achieves higher G-measure and AUC values. Keywords—software defect prediction;cross-company defect prediction;data filter; Agglomerative clustering
What problem does this paper attempt to address?