Abstract:Deep learning (DL) libraries have become the key component in developing and deploying DL-based software nowadays. With the growing popularity of applying DL models in both academia and industry across various domains, any bugs inherent in the DL libraries can potentially cause unexpected server outcomes. As such, there is an urgent demand for improving the software quality of DL libraries. Although there are some existing approaches specifically designed for testing DL libraries, their focus is usually limited to one specific domain, such as computer vision (CV). It is still not very clear how the existing approaches perform in detecting bugs of different DL libraries regarding different task domains and to what extent. To bridge this gap, we first conduct an empirical study on four representative and state-of-the-art DL library testing approaches. Our empirical study results reveal that it is hard for existing approaches to generalize to other task domains. We also find that the test inputs generated by these approaches usually lack diversity, with only a few types of bugs. What is worse, the false-positive rate of existing approaches is also high ( up to 58% ). To address these issues, we propose a guided differential fuzzing approach based on generation , namely, Gandalf . To generate testing inputs across diverse task domains effectively, Gandalf adopts the context-free grammar to ensure validity and utilizes a Deep Q-Network to maximize the diversity. Gandalf also includes 15 metamorphic relations to make it possible for the generated test cases to generalize across different DL libraries. Such a design can decrease the false positives because of the semantic difference for different APIs. We evaluate the effectiveness of Gandalf on nine versions of three representative DL libraries, covering 309 operators from computer vision, natural language processing, and automated speech recognition. The evaluation results demonstrate that Gandalf can effectively and efficiently generate diverse test inputs. Meanwhile, Gandalf successfully detects five categories of bugs with only 3.1% false-positive rates. We report all 49 new unique bugs found during the evaluation to the DL libraries’ developers, and most of these bugs have been confirmed. Details about our empirical study and evaluation results are available on our project website. 1

DeepDiffer: Find Deep Learning Compiler Bugs Via Priority-guided Differential Fuzzing

Fuzzing Deep Learning Compilers with HirGen

NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers

Detecting Compiler Bugs Via a Deep Learning-Based Framework

Coverage-guided tensor compiler fuzzing with joint IR-pass mutation

Metamorphic Testing of Deep Learning Compilers

Detecting Numerical Deviations in Deep Learning Models Introduced by the TVM Compiler

Toward Understanding Deep Learning Framework Bugs

On the usage and development of deep learning compilers: an empirical study on TVM

DeepCov: Coverage Guided Deep Learning Framework Fuzzing

Deep Differential Testing of JVM Implementations

A Tale of Two DL Cities: When Library Tests Meet Compiler

Generation-based Differential Fuzzing for Deep Learning Libraries

TorchProbe: Fuzzing Dynamic Deep Learning Compilers

Differential testing solidity compiler through deep contract manipulation and mutation

DLLens: Testing Deep Learning Libraries via LLM-aided Synthesis

DLFuzz: Differential Fuzzing Testing of Deep Learning Systems.

Effective Random Test Generation for Deep Learning Compilers

Graph-Based Fuzz Testing for Deep Learning Inference Engine

WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models

Graph-based Fuzz Testing for Deep Learning Inference Engines