GitBug-Java: A Reproducible Benchmark of Recent Java Bugs

André Silva,Nuno Saavedra,Martin Monperrus
2024-02-06
Abstract:Bug-fix benchmarks are essential for evaluating methodologies in automatic program repair (APR) and fault localization (FL). However, existing benchmarks, exemplified by Defects4J, need to evolve to incorporate recent bug-fixes aligned with contemporary development practices. Moreover, reproducibility, a key scientific principle, has been lacking in bug-fix benchmarks. To address these gaps, we present GitBug-Java, a reproducible benchmark of recent Java bugs. GitBug-Java features 199 bugs extracted from the 2023 commit history of 55 notable open-source repositories. The methodology for building GitBug-Java ensures the preservation of bug-fixes in fully-reproducible environments. We publish GitBug-Java at
Software Engineering
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two key issues in the current software defect - fixing benchmark tests: 1. **Defects are not recent**: Most of the defects included in existing benchmark tests, such as Defects4J, are from 10 to 15 years ago. This makes these benchmark tests less relevant when evaluating automatic program repair (APR) and fault localization (FL) techniques in modern software stacks and development practices. Moreover, using this old data for evaluation may also lead to data leakage, especially when evaluating techniques based on large - language models (LLMs), because these models may have seen this data during the training process. 2. **Defects are not reproducible**: The reproducibility of benchmark tests is crucial for scientific research, as it allows the replication of research results and systematic comparative evaluation in subsequent research. However, existing defect - fixing benchmark tests have difficulties in maintaining long - term reproducibility. For example, Zhu et al. pointed out that the reproducibility range of existing defect - fixing benchmark tests is between 26.6% and 96.9%, and none can achieve full reproducibility. To address these two challenges, the authors propose GitBug - Java, which is a reproducible benchmark test set containing 199 recent Java defects. These defects come from 55 relevant open - source repositories, and all defects are from the commit history in 2023, ensuring their novelty and relevance. At the same time, GitBug - Java provides an offline execution environment, ensuring that these defects and their fixes can still be reproduced even when future dependencies are no longer available on the network. In this way, GitBug - Java not only solves the time - relevance and reproducibility problems of existing benchmark tests but also provides high - quality data resources for future research in program repair, fault localization, and related fields.