Abstract:In recent years, deep learning obtains amazing achievements in various fields, and has been used in safety-critical scenarios. In such scenarios, bugs in deep learning software can introduce disastrous consequences. To deepen the understanding on bugs in deep learning software, researchers have conducted several empirical studies on their bug characteristics. In the prior studies, researchers analyzed the source code, bug reports, pull requests, and fixes of deep learning bugs. Although these studies provide meaningful findings, to the best of our knowledge, no prior studies have explored the runtime behaviors of deep learning bugs, because it is rather expensive to collect runtime impacts of deep learning bugs. As a result, some fundamental questions along with deep learning bugs are still open. For example, do most such bugs introduce significant impacts on prediction accuracy? The answers to these open questions are useful to a wide range of audience. In this paper, we conducted the first empirical study to analyze the runtime impacts of deep learning bugs. Our basic idea is to inject deliberately designed bugs into a typical deep learning application and its libraries with a mutation tool, and to compare the runtime differences between clean and buggy versions. In this way, we constructed 1,832 buggy versions, and compared their execution results with corresponding clean versions. Based on our comparison, we summarize 9 findings, and present our answers to 3 research questions. For example, we find that more than half of buggy versions do not lead to any observable errors, and most of them introduce only insignificant differences on the accuracy of their trained models. We interpret the significance of our findings from the perspectives of application programmers, API developers, and researchers. For example, based on our findings, better results alone are insufficient to prove better parameters nor better treatments, and researchers shall build strong theories to explain their improvements.

An Empirical Study on Numerical Bugs in Deep Learning Programs.

An Empirical Study on TensorFlow Program Bugs

An Empirical Study on Bugs Inside TensorFlow

Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow

Toward Understanding Deep Learning Framework Bugs

Exposing Numerical Bugs in Deep Learning Via Gradient Back-Propagation

The Symptoms, Causes, and Repairs of Bugs Inside a Deep Learning Library

An empirical study on common bugs in deep learning compilers

How Do Injected Bugs Affect Deep Learning?

Detecting Numerical Bugs in Neural Network Architectures

An Empirical Study on Correlation between Coverage and Robustness for Deep Neural Networks

Characterizing Performance Bugs in Deep Learning Systems

DeepStability: A Study of Unstable Numerical Methods and Their Solutions in Deep Learning

An Empirical Study of Bugs in Machine Learning Systems

A Comprehensive Study of Deep Learning Compiler Bugs

Gdefects4dl: A Dataset of General Real-World Deep Learning Program Defects

On Reporting Performance and Accuracy Bugs for Deep Learning Frameworks: An Exploratory Study from GitHub

An Empirical Study on Program Failures of Deep Learning Jobs

Understanding Bugs in Multi-Language Deep Learning Frameworks

An Empirical Study on Bugs Inside PyTorch: A Replication Study

Gdefects4dl