SeDPGK: Semi-supervised software defect prediction with graph representation learning and knowledge distillation

Wangshu Liu,Ye Yue,Xiang Chen,Qing Gu,Pengzhan Zhao,Xuejun Liu,Jianjun Zhao
DOI: https://doi.org/10.1016/j.infsof.2024.107510
IF: 3.9
2024-06-23
Information and Software Technology
Abstract:Context: Constructing an effective defect prediction model relies on a substantial number of labeled program modules. Unfortunately, program module labeling is often time-consuming and error-prone. Semi-supervised software defect prediction (SSDP) can alleviate this issue by incorporating some labeled modules and the remaining unlabeled modules from the same project. Objective: However, previous SSDP methods ignore the significant influence of dependencies between software modules. The potential of knowledge distillation in leveraging labeled instances to guide the learning process and effectively utilizing information from unlabeled instances to improve SSDP performance has not been fully investigated. Method: We propose a novel approach SeDPGK. Specifically, to exploit the graph-structured knowledge, we first construct the program dependence graph to extract control and data dependencies among modules. Then we use graph neural networks (GNNs) to learn the graph representation of the module relationships and encode with the statement semantics of abstract syntax tree and traditional static features for diversity. Second, we integrate multiple GNNs jointly trained as teacher models to ensemble various styles of graph-based networks and generate trustworthy labels for unlabeled modules. Further, to preserve the teacher model's sufficient structure and semantic knowledge, we adopt a trainable label propagation and multi-layer perception as the student model and mitigate the differences between the teacher and student models using two widespread knowledge distillation functions. Results: We conducted our experiments on 17 real-world projects. The experimental results show that SeDPGK outperforms semi-supervised baselines with an average improvement of 16.9% for PD, 42.5% for FAR, and 8.9% for AUC, respectively. Moreover, the performance improvement is consistently significant across multiple statistical tests. Conclusion: The effectiveness of SeDPGK comes from the aggregation of the different GNNs with heterogeneity. Moreover, the graph structure and semantic features hidden behind the source code play a crucial role in the distillation framework.
computer science, information systems, software engineering
What problem does this paper attempt to address?