Abstract:Deep Learning (DL) libraries, such as PyTorch, are widely used for building and deploying DL models on various hardware platforms. Meanwhile, they are found to contain bugs that lead to incorrect calculation results and cause issues like non-convergence training and inaccurate prediction of DL models. Thus, many efforts have been made to test DL libraries and reveal bugs. However, existing DL library testing methods manifest limitations: model-level testing methods cause complexity in fault localization. Meanwhile, API-level testing methods often generate invalid inputs or primarily focus on extreme inputs that lead to crash failures; they also ignore testing realistic API interactions. These limitations may lead to missing detection of bugs, even in the frequently used APIs. To address these limitations, we propose SORT (Subgraph-Oriented Realistic Testing) to differential test DL libraries on different hardware platforms. SORT takes popular API interaction patterns, represented as frequent subgraphs of model computation graphs, as test subjects. In this way, it introduces realistic API interaction sequences while maintaining efficiency in locating faulty APIs for observed errors. Besides, SORT prepares test inputs by referring to extensive features of runtime inputs for each API in executing real-life benchmark data. The generated inputs are expected to better simulate such valid real inputs and reveal bugs more likely to happen in real-life usage. Evaluation on 728 frequent subgraphs of 49 popular PyTorch models demonstrates that SORT achieves a 100\% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing. 18 precision bugs in PyTorch are identified.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the testing limitations in deep learning (DL) libraries (such as PyTorch), which make it difficult to find and fix hidden bugs. Specifically, the existing methods have the following deficiencies: 1. **Lack of effective input generation**: - The inputs generated by the existing API - level testing methods often fail to pass the API validity check (for example, due to incorrect tensor shapes), resulting in a low proportion of valid inputs generated. - Even if some inputs can pass the validity check, they are usually boundary or extreme values, which mainly lead to crash failures and ignore precision bugs. Precision bugs have a significant impact on the reliability of high - performance computing and deep - learning models. 2. **Lack of the ability to detect precision bugs**: - Most of the inputs generated by the existing methods are extreme cases that lead to crashes, while in actual deployment scenarios, these cases occur less frequently. - Precision bugs (that is, there is a precision loss between the calculated result and the correct result) are a key issue in high - precision computing, but the existing methods can rarely identify these bugs. 3. **Lack of real - world API interaction testing**: - Traditional API - level testing methods test each API independently and cannot detect errors caused by interactions between APIs. - Some methods attempt to artificially combine APIs for testing, but these combinations do not necessarily reflect the real - world API interaction patterns, so the discovered errors are unlikely to occur in actual use. To solve these problems, the paper proposes a new deep - learning library testing method - SORT (Subgraph - Oriented Realistic Testing). SORT improves the existing testing methods in the following ways: - **Introducing frequent subgraphs as test objects**: Frequent subgraphs reflect the common API interaction patterns in real - world use while maintaining the efficiency of locating faulty APIs. - **Generating test inputs based on real - time input characteristics**: By recording the input characteristics of APIs when executing real - world models, more valid inputs closer to actual use are generated. - **Differential testing**: Execute tests on different hardware platforms (such as CPU and GPU) to reveal potential precision bugs and other differences. Through these improvements, SORT can discover more precision bugs in more realistic scenarios and improve the effectiveness and reliability of testing.

Subgraph-Oriented Testing for Deep Learning Libraries

DLLens: Testing Deep Learning Libraries via LLM-aided Synthesis

A Survey of Deep Learning Library Testing Methods

A Tale of Two DL Cities: When Library Tests Meet Compiler

Audee: Automated Testing for Deep Learning Frameworks

Generation-based Differential Fuzzing for Deep Learning Libraries

Toward Understanding Deep Learning Framework Bugs

NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers

CITADEL: Context Similarity Based Deep Learning Framework Bug Finding

ACETest: Automated Constraint Extraction for Testing Deep Learning Operators

RobOT: Robustness-Oriented Testing for Deep Learning Systems

Graph-Based Fuzz Testing for Deep Learning Inference Engine

Graph-based Fuzz Testing for Deep Learning Inference Engines

The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning Libraries

Muffin: Testing Deep Learning Libraries via Neural Architecture Fuzzing

Q uo T e : Quality-oriented Testing for Deep Learning Systems

Deep Learning Framework Testing Via Hierarchical and Heuristic Model Generation.

GDsmith: Detecting Bugs in Graph Database Engines

DevMuT: Testing Deep Learning Framework Via Developer Expertise-Based Mutation

Checker Bug Detection and Repair in Deep Learning Libraries

DeepXplore: Automated Whitebox Testing of Deep Learning Systems