Abstract:Context: Researchers testing hypotheses related to factors leading to low-quality software often rely on historical data, specifically on details regarding when defects were introduced into a codebase of interest. The prevailing techniques to determine the introduction of defects revolve around variants of the SZZ algorithm. This algorithm leverages information on the lines modified during a bug-fixing commit and finds when these lines were last modified, thereby identifying bug-introducing commits. Objectives: Despite several improvements and variants, SZZ struggles with accuracy, especially in cases of unrelated modifications or that touch files not involved in the introduction of the bug in the version control systems (aka tangled commit and ghost commits). Methods: Our research investigates whether and how incorporating content retrieved from bug discussions can address these issues by identifying the related and external files and thus improve the efficacy of the SZZ algorithm. Results: To conduct our investigation, we take advantage of the links manually inserted by Mozilla developers in bug reports to signal which commits inserted bugs. Thus, we prepared the dataset, RoTEB, comprised of 12,472 bug reports. We first manually inspect a sample of 369 bug reports related to these bug-fixing or bug-introducing commits and investigate whether the files mentioned in these reports could be useful for SZZ. After we found evidence that the mentioned files are relevant, we augment SZZ with this information, using different strategies, and evaluate the resulting approach against multiple SZZ variations. Conclusion: We define a taxonomy outlining the rationale behind developers' references to diverse files in their discussions. We observe that bug discussions often mention files relevant to enhancing the SZZ algorithm's efficacy. Then, we verify that integrating these file references augments the precision of SZZ in pinpointing bug-introducing commits. Yet, it does not markedly influence recall. These results deepen our comprehension of the usefulness of bug discussions for SZZ. Future work can leverage our dataset and explore other techniques to further address the problem of tangled commits and ghost commits. Data & material: https://zenodo.org/records/11484723.

Multi-extract and Multi-level Dataset of Mozilla Issue Tracking History

A Multi-level Dataset of Linux Kernel Patchwork

RegMiner: Towards Constructing a Large Regression Dataset from Code Evolution History

Mining Bug Repositories for Multi-Fault Programs

Issue Workflow Explorer

Mining Issue Trackers: Concepts and Techniques

On Refining the SZZ Algorithm with Bug Discussion Data

PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection

A New MI-Based Visualization Aided Validation Index for Mining Big Longitudinal Web Trial Data

RegMiner: Mining Replicable Regression Dataset from Code Repositories.

COVID-Scraper: An Open-Source Toolset for Automatically Scraping and Processing Global Multi-Scale Spatiotemporal COVID-19 Records

BugMiner: Automating Precise Bug Dataset Construction by Code Evolution History Mining

Be Careful of When: an Empirical Study on Time-Related Misuse of Issue Tracking Data.

Fingerprinting and Building Large Reproducible Datasets

npm-follower: A Complete Dataset Tracking the NPM Ecosystem

A Study of the Extraction of Bug Judgment and Correction Times from Open Source Software Bug Logs.

PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

A Method To Identify And Correct Problematic Software Activity Data: Exploiting Capacity Constraints And Data Redundancies

Practice Evolution Explorer

Tiangong-St: A New Dataset With Large-Scale Refined Real-World Web Search Sessions