Abstract:Context: Constructing an effective defect prediction model relies on a substantial number of labeled program modules. Unfortunately, program module labeling is often time-consuming and error-prone. Semi-supervised software defect prediction (SSDP) can alleviate this issue by incorporating some labeled modules and the remaining unlabeled modules from the same project. Objective: However, previous SSDP methods ignore the significant influence of dependencies between software modules. The potential of knowledge distillation in leveraging labeled instances to guide the learning process and effectively utilizing information from unlabeled instances to improve SSDP performance has not been fully investigated. Method: We propose a novel approach SeDPGK. Specifically, to exploit the graph-structured knowledge, we first construct the program dependence graph to extract control and data dependencies among modules. Then we use graph neural networks (GNNs) to learn the graph representation of the module relationships and encode with the statement semantics of abstract syntax tree and traditional static features for diversity. Second, we integrate multiple GNNs jointly trained as teacher models to ensemble various styles of graph-based networks and generate trustworthy labels for unlabeled modules. Further, to preserve the teacher model's sufficient structure and semantic knowledge, we adopt a trainable label propagation and multi-layer perception as the student model and mitigate the differences between the teacher and student models using two widespread knowledge distillation functions. Results: We conducted our experiments on 17 real-world projects. The experimental results show that SeDPGK outperforms semi-supervised baselines with an average improvement of 16.9% for PD, 42.5% for FAR, and 8.9% for AUC, respectively. Moreover, the performance improvement is consistently significant across multiple statistical tests. Conclusion: The effectiveness of SeDPGK comes from the aggregation of the different GNNs with heterogeneity. Moreover, the graph structure and semantic features hidden behind the source code play a crucial role in the distillation framework.

SeDPGK: Semi-supervised software defect prediction with graph representation learning and knowledge distillation

GKD: Semi-supervised Graph Knowledge Distillation for Graph-Independent Inference

Knowledge Distillation Improves Graph Structure Augmentation for Graph Neural Networks

DeMuVGN: Effective Software Defect Prediction Model by Learning Multi-view Software Dependency via Graph Neural Networks

S<SUP>2</SUP>LMMD: Cross-Project Software Defect Prediction via Statement Semantic Learning and Maximum Mean Discrepancy

Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation

Decoupled graph knowledge distillation: A general logits-based method for learning MLPs on graphs

LineFlowDP: A Deep Learning-Based Two-Phase Approach for Line-Level Defect Prediction

UDA-DP: Unsupervised Domain Adaptation for Software Defect Prediction

Attention based GRU-LSTM for software defect prediction

Generative Denoise Distillation: Simple Stochastic Noises Induce Efficient Knowledge Transfer for Dense Prediction

GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation

GraphSPD: Graph-Based Security Patch Detection with Enriched Code Semantics.

Cross-Project and Within-Project Semi-Supervised Software Defect Prediction Problems Study Using a Unified Solution

A novel defect prediction method based on semantic feature enhancement

Cross-project defect prediction via semantic and syntactic encoding

A Survey of Software Defect Prediction Based on Deep Learning

Edge-free but Structure-aware: Prototype-Guided Knowledge Distillation from GNNs to MLPs

Landmark-Based Domain Adaptation and Selective Pseudo-Labeling for Heterogeneous Defect Prediction

An Approach to Semantic and Structural Features Learning for Software Defect Prediction

Enhanced Scalable Graph Neural Network via Knowledge Distillation