Abstract:Context: Deep learning (DL) techniques have gained significant popularity among software engineering (SE) researchers in recent years. This is because they can often solve many SE challenges without enormous manual feature engineering effort and complex domain knowledge. Objective: Although many DL studies have reported substantial advantages over other state-of-the-art models on effectiveness, they often ignore two factors: (1) reproducibility —whether the reported experimental results can be obtained by other researchers using authors’ artifacts (i.e., source code and datasets) with the same experimental setup; and (2) replicability —whether the reported experimental result can be obtained by other researchers using their re-implemented artifacts with a different experimental setup. We observed that DL studies commonly overlook these two factors and declare them as minor threats or leave them for future work. This is mainly due to high model complexity with many manually set parameters and the time-consuming optimization process, unlike classical supervised machine learning (ML) methods (e.g., random forest). This study aims to investigate the urgency and importance of reproducibility and replicability for DL studies on SE tasks. Method: In this study, we conducted a literature review on 147 DL studies recently published in 20 SE venues and 20 AI (Artificial Intelligence) venues to investigate these issues. We also re-ran four representative DL models in SE to investigate important factors that may strongly affect the reproducibility and replicability of a study. Results: Our statistics show the urgency of investigating these two factors in SE, where only 10.2% of the studies investigate any research question to show that their models can address at least one issue of replicability and/or reproducibility. More than 62.6% of the studies do not even share high-quality source code or complete data to support the reproducibility of their complex models. Meanwhile, our experimental results show the importance of reproducibility and replicability, where the reported performance of a DL model could not be reproduced for an unstable optimization process. Replicability could be substantially compromised if the model training is not convergent, or if performance is sensitive to the size of vocabulary and testing data. Conclusion: It is urgent for the SE community to provide a long-lasting link to a high-quality reproduction package, enhance DL-based solution stability and convergence, and avoid performance sensitivity on different sampled data.

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

Prototype-Based Supervised Contrastive Learning Method for Noisy Label Correction in Tire Defect Detection

Rethinking Noisy Label Learning in Real-world Annotation Scenarios from the Noise-type Perspective

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets.

Learning with Noisy Labels Via Self-supervised Adversarial Noisy Masking

Meta-Self-Training Based on Teacher–Student Network for Industrial Label-Noise Fault Diagnosis

The Impact of Mislabeled Changes by SZZ on Just-in-Time Defect Prediction.

Don't Blame the Data, Blame the Model: Understanding Noise and Bias When Learning from Subjective Annotations

An Improved Semi-Supervised Learning Method for Software Defect Prediction.

Deep learning with noisy labels in medical prediction problems: a scoping review

Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (experience Paper).

Uncertainty-Aware Learning against Label Noise on Imbalanced Datasets

Improving deep label noise learning with dual active label correction

Learning to Detect Noisy Labels Using Model-Based Features

Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

An effective self-supervised learning method for various seismic noise attenuation

Quantifying and mitigating the impact of label errors on model disparity metrics

On the Reproducibility and Replicability of Deep Learning in Software Engineering

Learning from Incomplete and Inaccurate Supervision

Learning with Feature-Dependent Label Noise: A Progressive Approach