Abstract:Software defect prediction (SDP) is a crucial phase preceding the launch of software products. Cross-project defect prediction (CPDP) is introduced for the anticipation of defects in novel projects lacking defect labels. CPDP can use defect information of mature projects to speed up defect prediction for new projects. So that developers can quickly get the defect information of the new project, so that they can test the software project pertinently. At present, the predominant approaches in CPDP rely on deep learning, and the performance of the ultimate model is notably affected by the quality of the training dataset. However, the dataset of CPDP not only has few samples but also has almost no label information in new projects, which makes the general deep-learning-based CPDP model not ideal. In addition, most of the current CPDP models do not fully consider the enrichment of classification boundary samples after cross-domain, leading to suboptimal predictive capabilities of the model. To overcome these obstacles, we present contrastive learning pretraining for CPDP (ConCPDP), a CPDP method integrating contrastive pretraining and category boundary adjustment. We first perform data augmentation on the source and target domain code files and then extract the enhanced data as an abstract syntax tree (AST). The AST is then transformed into an integer sequence using specific mapping rules, serving as input for the subsequent neural network. A neural network based on bidirectional long short-term memory (Bi-LSTM) will receive an integer sequence and output a feature vector. Then, the feature vectors are input into the contrastive module to optimise the feature extraction network. The pretrained feature extractor can be fine-tuned by the maximum mean discrepancy (MMD) between the feature distribution of the source domain and the target domain and the binary classification loss on the source domain. This paper conducts a large number of experiments on the PROMISE dataset, which is commonly used for CPDP, to validate ConCPDP’s efficacy, achieving superior results in terms of F1 measure, area under curve (AUC), and Matthew’s correlation coefficient (MCC).

Adversarial Domain Adaptation for Cross-Project Defect Prediction

Joint Domain Adaption and Pseudo-Labeling for Cross-Project Defect Prediction

Conditional Domain Adversarial Adaptation for Heterogeneous Defect Prediction.

HDA: Cross-Project Defect Prediction Via Heterogeneous Domain Adaptation with Dictionary Learning.

Balanced Adversarial Tight Matching for Cross‐Project Defect Prediction

Landmark-Based Domain Adaptation and Selective Pseudo-Labeling for Heterogeneous Defect Prediction

DSSDPP: Data Selection and Sampling Based Domain Programming Predictor for Cross-Project Defect Prediction

Generative Adversarial Network-based Cross-Project Fault Prediction

Defect Category Prediction Based on Multi-Source Domain Adaptation

Cross Project Defect Prediction via Balanced Distribution Adaptation Based Transfer Learning

ConCPDP: A Cross-Project Defect Prediction Method Integrating Contrastive Pretraining and Category Boundary Adjustment

UDA-DP: Unsupervised Domain Adaptation for Software Defect Prediction

Cross-Project Software Defect Prediction Using Feature-Based Transfer Learning

Cross-Project Software Defect Prediction Based on SMOTE and Deep Canonical Correlation Analysis

A Cross‐project Defect Prediction Method Based on Multi‐adaptation and Nuclear Norm

Heterogeneous Cross-Project Defect Prediction with Multiple Source Projects Based on Transfer Learning

Cross-Project Defect Prediction Based on Two-Phase Feature Importance Amplification

Unsupervised Deep Domain Adaptation for Heterogeneous Defect Prediction

Combined Classifier for Cross-Project Defect Prediction: an Extended Empirical Study.

MHCPDP: multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder