Abstract:Unit testing has become an essential practice during software development and maintenance. Effective unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain. A unit test consists of a test prefix and a test oracle. Synthesizing test oracles, especially functional oracles, is a well-known challenging problem. Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG), and obtained promising results. However, after a systematic inspection, we find there are some inappropriate settings in existing evaluation methods for NTOG. These settings could mislead the understanding of existing NTOG approaches’ performance. We summarize them as 1) generating test prefixes from bug-fixed program versions, 2) evaluating with an unrealistic metric, and 3) lacking a straightforward baseline. In this paper, we first investigate the impacts of these settings on evaluating and understanding the performance of NTOG approaches. We find that 1) unrealistically generating test prefixes from bug-fixed program versions inflates the number of bugs found by the state-of-the-art NTOG approach TOGA by 61.8%, 2) FPR (False Positive Rate) is not a realistic evaluation metric and the Precision of TOGA is only 0.38%, and 3) a straightforward baseline NoException, which simply expects no exception should be raised, can find 61% of the bugs found by TOGA with twice the Precision. Furthermore, we introduce an additional ranking step to existing evaluation methods and propose an evaluation metric named Found@K to better measure the cost-effectiveness of NTOG approaches in terms of bug-finding. We propose a novel unsupervised ranking method to instantiate this ranking step, significantly improving the cost-effectiveness of TOGA. Eventually, based on our experimental results and observations, we propose a more realistic evaluation method TEval+ for NTOG and summarize seven rules of thumb to boost NTOG approaches into their practical usages.

Detecting Transactional Bugs in Database Engines via Graph-Based Oracle Construction

Detecting DBMS Bugs with Context-Sensitive Instantiation and Multi-Plan Execution

Towards More Realistic Evaluation for Neural Test Oracle Generation

Detecting Logic Bugs in Database Engines Via Equivalent Expression Transformation.

Testing Database Engines via Query Plan Guidance

Detecting Logic Bugs of Join Optimizations in DBMS.

Effective Bug Detection in Graph Database Engines: An LLM-based Approach

A Demonstration of DLBD: Database Logic Bug Detection System.

Mozi: Discovering DBMS Bugs Via Configuration-Based Equivalent Transformation

First, Debug the Test Oracle

GDsmith: Detecting Bugs in Graph Database Engines

Go-Oracle: Automated Test Oracle for Go Concurrency Bugs

DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation

Debugging Transactions and Tracking their Provenance with Reenactment

Conformance Testing of Relational DBMS Against SQL Specifications

Finding bugs in database systems via query partitioning

An Empirical Study on the Characteristics of Database Access Bugs in Java Applications

Verifying Synchronization for Atomicity Violation Fixing

Detecting optimization bugs in database engines via non-optimizing reference engine construction

Transaction management in multi-user CAD environment

Coo: Consistency Check for Transactional Databases