Abstract:Automated program repair (APR) has been gaining ground with substantial effort devoted to the area, opening up many challenges and opportunities. One such challenge is that the state-of-the-art repair techniques often resort to incomplete specifications, e.g., test cases that witness buggy behavior, to generate repairs. In practice, bug-exposing test cases are often available when: (1) developers, at the same time of (or after) submitting bug fixes, create the tests to assure the correctness of the fixes, or (2) regression errors occur. The former case – a scenario commonly used for creating popular bug datasets – however, may not be suitable to assess how APR performs in the wild. Since developers already know where and how to fix the bugs, tests created in this case may encapsulate knowledge gained only after bugs are fixed. Thus, more effort is needed to create datasets for more realistically evaluating APR.We address this challenge by creating a dataset focusing on bugs identified via continuous integration (CI) failures – a special case of regression errors – wherein bugs happen when the program after being changed is re-executed on the existing test suite. We argue that CI failures, wherein bug-exposing tests are created before bug fixes and thus assume no prior knowledge of developers on the bugs to be involved, are more realistic for evaluating APR. Toward this end, we curated 102 CI failures from 40 popular real-world software on GitHub. We demonstrate various features and the usefulness of the dataset via an evaluation of five well-known APR techniques, namely GenProg, Kali, Cardumen, RsRepair and Arja. We subsequently discuss several findings and implications for future APR studies. Overall, experiment results show that our dataset is complementary to existing datasets such as Defect4J in realistic evaluations of APR.

TrickyBugs: A Dataset of Corner-case Bugs in Plausible Programs

Learning Likely Invariants to Explain Why a Program Fails

Detecting DBMS Bugs with Context-Sensitive Instantiation and Multi-Plan Execution

The Future Can’t Help Fix the Past: Assessing Program Repair in the Wild

Characterizing Common and Domain-Specific Package Bugs: A Case Study on Ubuntu.

PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection

On the Rise and Fall of Simple Stupid Bugs: a Life-Cycle Analysis of SStuBs

RunBugRun -- An Executable Dataset for Automated Program Repair

BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies

Gdefects4dl: A Dataset of General Real-World Deep Learning Program Defects

HyperPUT: generating synthetic faulty programs to challenge bug-finding tools

How Well Industry-Level Cause Bisection Works in Real-World: A Study on Linux Kernel

BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes

RaceBench: A Triggerable and Observable Concurrency Bug Benchmark.

Hunting for bugs in code coverage tools via randomized differential testing

Finding Bug-Inducing Program Environments

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Mining Bug Repositories for Multi-Fault Programs

TSSB-3M: Mining single statement bugs at massive scale

Common Bugs in Scratch Programs

An Empirical Study on TensorFlow Program Bugs