Abstract:Deep learning (DL) libraries have become the key component in developing and deploying DL-based software nowadays. With the growing popularity of applying DL models in both academia and industry across various domains, any bugs inherent in the DL libraries can potentially cause unexpected server outcomes. As such, there is an urgent demand for improving the software quality of DL libraries. Although there are some existing approaches specifically designed for testing DL libraries, their focus is usually limited to one specific domain, such as computer vision (CV). It is still not very clear how the existing approaches perform in detecting bugs of different DL libraries regarding different task domains and to what extent. To bridge this gap, we first conduct an empirical study on four representative and state-of-the-art DL library testing approaches. Our empirical study results reveal that it is hard for existing approaches to generalize to other task domains. We also find that the test inputs generated by these approaches usually lack diversity, with only a few types of bugs. What is worse, the false-positive rate of existing approaches is also high ( up to 58% ). To address these issues, we propose a guided differential fuzzing approach based on generation , namely, Gandalf . To generate testing inputs across diverse task domains effectively, Gandalf adopts the context-free grammar to ensure validity and utilizes a Deep Q-Network to maximize the diversity. Gandalf also includes 15 metamorphic relations to make it possible for the generated test cases to generalize across different DL libraries. Such a design can decrease the false positives because of the semantic difference for different APIs. We evaluate the effectiveness of Gandalf on nine versions of three representative DL libraries, covering 309 operators from computer vision, natural language processing, and automated speech recognition. The evaluation results demonstrate that Gandalf can effectively and efficiently generate diverse test inputs. Meanwhile, Gandalf successfully detects five categories of bugs with only 3.1% false-positive rates. We report all 49 new unique bugs found during the evaluation to the DL libraries’ developers, and most of these bugs have been confirmed. Details about our empirical study and evaluation results are available on our project website. 1

The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning Libraries

Generation-based Differential Fuzzing for Deep Learning Libraries

Large Language Models are Edge-Case Fuzzers: Testing Deep Learning Libraries via FuzzGPT

FA-Fuzz: A Novel Scheduling Scheme Using Firefly Algorithm for Mutation-Based Fuzzing

Security Knowledge-Guided Fuzzing of Deep Learning Libraries

Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Python Coverage Guided Fuzzing for Deep Learning Framework

SkipFuzz: Active Learning-based Input Selection for Fuzzing Deep Learning Libraries

A First Look at the Effect of Deep Learning in Coverage-guided Fuzzing

FDFuzz: Applying Feature Detection to Fuzz Deep Learning Systems

Muffin: Testing Deep Learning Libraries via Neural Architecture Fuzzing

Deep Learning Framework Fuzzing Based on Model Mutation

Toward Understanding Deep Learning Framework Bugs

Python Fuzzing for Trustworthy Machine Learning Frameworks

DeepCov: Coverage Guided Deep Learning Framework Fuzzing

MoCo: Fuzzing Deep Learning Libraries Via Assembling Code

DLFuzz: Differential Fuzzing Testing of Deep Learning Systems.

DRLFCfuzzer: fuzzing with Deep-Reinforcement-Learning under Format Constraints

High-performance Directional Fuzzing Scheme Based on Deep Reinforcement Learning

DeFuzz: Deep Learning Guided Directed Fuzzing

LAFuzz: Neural Network for Efficient Fuzzing