An Automated and Flexible Multilingual Bug-Fix Dataset Construction System

Wenkang Zhong,Chuanyi Li,Yunfei Zhang,Ziwen Ge,Jingyu Wang,Jidong Ge,Bin Luo
DOI: https://doi.org/10.1109/ase56229.2023.00176
2024-01-01
Abstract:Developing effective data-driven automated bug-fixing approaches is heavily relying on large bug-fix datasets. However, the granularity of current repository-mined bug-fixing datasets is usually at the function level, without meta-information such as the fault type. In order to alleviate the open challenge of precisely mining code snippets with bugs, their fix, location, and types from open source repositories, in this paper, we propose a flexible, extensible, and automated multilingual bug-fix dataset construction system, that is, the Multilingual Bug-Fix Constructor (MBFC). Furthermore, we release a large-scale and fine-grained Multi-lingual Bug-Fix (M-BF) dataset automatically built using the proposed system, which includes a total of 921,825 Bug-Fix pairs that are from 442,164 different open-source software projects starting from January 2020 to September 2020 in the initial version. It is expected that our system and dataset can benefit the development of innovative and practical program repair methods, thereby improving the efficiency of program debugging and code review processes.
What problem does this paper attempt to address?