Abstract:Deep learning (DL) libraries have become the key component in developing and deploying DL-based software nowadays. With the growing popularity of applying DL models in both academia and industry across various domains, any bugs inherent in the DL libraries can potentially cause unexpected server outcomes. As such, there is an urgent demand for improving the software quality of DL libraries. Although there are some existing approaches specifically designed for testing DL libraries, their focus is usually limited to one specific domain, such as computer vision (CV). It is still not very clear how the existing approaches perform in detecting bugs of different DL libraries regarding different task domains and to what extent. To bridge this gap, we first conduct an empirical study on four representative and state-of-the-art DL library testing approaches. Our empirical study results reveal that it is hard for existing approaches to generalize to other task domains. We also find that the test inputs generated by these approaches usually lack diversity, with only a few types of bugs. What is worse, the false-positive rate of existing approaches is also high ( up to 58% ). To address these issues, we propose a guided differential fuzzing approach based on generation , namely, Gandalf . To generate testing inputs across diverse task domains effectively, Gandalf adopts the context-free grammar to ensure validity and utilizes a Deep Q-Network to maximize the diversity. Gandalf also includes 15 metamorphic relations to make it possible for the generated test cases to generalize across different DL libraries. Such a design can decrease the false positives because of the semantic difference for different APIs. We evaluate the effectiveness of Gandalf on nine versions of three representative DL libraries, covering 309 operators from computer vision, natural language processing, and automated speech recognition. The evaluation results demonstrate that Gandalf can effectively and efficiently generate diverse test inputs. Meanwhile, Gandalf successfully detects five categories of bugs with only 3.1% false-positive rates. We report all 49 new unique bugs found during the evaluation to the DL libraries’ developers, and most of these bugs have been confirmed. Details about our empirical study and evaluation results are available on our project website. 1

Reinforcement learning guided fuzz testing for a browser's HTML rendering engine

DFL : A DOM sample generation oriented fuzzing framework for browser rendering engines

FA-Fuzz: A Novel Scheduling Scheme Using Firefly Algorithm for Mutation-Based Fuzzing

CovRL: Fuzzing JavaScript Engines with Coverage-Guided Reinforcement Learning for LLM-based Mutation

Graph-Based Fuzz Testing for Deep Learning Inference Engine

Graph-based Fuzz Testing for Deep Learning Inference Engines

Coverage-guided fuzzing for deep reinforcement learning systems

Deep Learning-Based Hybrid Fuzz Testing

A Lightweight and High-Precision Approach for Bulky JavaScript Engines Fuzzing

Generation-based Fuzzing? Don't Build a New Generator, Reuse!

JIT-Picking: Differential Fuzzing of JavaScript Engines

Format-aware Learn&Fuzz: Deep Test Data Generation for Efficient Fuzzing

Mutation-Based Deep Learning Framework Testing Method in JavaScript Environment

Fuzzing Deep Learning Compilers with HirGen

Fuzzing MLIR Compilers with Custom Mutation Synthesis

Generation-based Differential Fuzzing for Deep Learning Libraries

FuzzFactory: domain-specific fuzzing with waypoints

Data Augmentation by Fuzzing for Neural Test Generation

Generator-Based Fuzzers with Type-Based Targeted Mutation

Superion: Grammar-Aware Greybox Fuzzing

Skyfire: Data-Driven Seed Generation for Fuzzing