StandUp4NPR: Standardizing SetUp for Empirically Comparing Neural Program Repair Systems.
Wenkang Zhong,Hongliang Ge,Hongfei Ai,Chuanyi Li,Kui Liu,Jidong Ge,Bin Luo
DOI: https://doi.org/10.1145/3551349.3556943
2022-01-01
Abstract:Recently, the emerging trend in automatic program repair is to apply deep neural networks to generate fixed code from buggy ones, called NPR (Neural Program Repair). However, the existing NPR systems are trained and evaluated under very different settings (e.g., different training data, inconsistent evaluation data, wide-ranged candidate numbers), which makes it hard to draw fair-enough conclusions when comparing them. Motivated by this, we first build a standard benchmark dataset and an extensive framework tool to mitigate threats for the comparison. The dataset consists of a training set, a validation set and an evaluation set with 144,641, 13,739 and 13,706 bug-fix pairs of Java respectively. The tool supports selecting specific training, validation, and evaluation datasets and automatically conducting the pipeline of training and evaluating NPR models, as well as easily integrating new NPR models by implementing well-defined interfaces. Then, based on the benchmark and tool, we conduct a comprehensive empirical comparison of six SOTA NPR systems w.r.t the repairability, inclination and generalizability. The experimental results reveal deeper characteristics of compared NPR systems and subvert some existing comparative conclusions, which further verify the necessity of unifying the experimental setups in exploring the progresses of NPR systems. Meanwhile, we reveal some common features of NPR systems (e.g., they are good at dealing with code-delete bugs). Finally, we identify some promising research directions derived from our findings.