AW4C: A Commit-Aware C Dataset for Actionable Warning Identification

Zhipeng Liu,Meng Yan,Zhipeng Gao,Dong Li,Xiaohong Zhang,Dan Yang
DOI: https://doi.org/10.1145/3643991.3644885
2024-01-01
Abstract:Excessive non-actionable warnings generated by static program analysis tools can hinder developers from utilizing these tools effectively. Leveraging learning-based approaches for actionable warning identification has demonstrated promise in boosting developer productivity, minimizing the risk of bugs, and reducing code smells. However, the small sizes of existing datasets have limited the model choices for machine learning researchers, and the lack of aligned fix commits limits the scope of the dataset for research. In this paper, we present AW4C, an actionable warning C dataset that contains 38,134 actionable warnings mined from more than 500 repositories on GitHub. These warnings are generated via Cppcheck, and most importantly, each warning is precisely mapped to the commit where the corrective action occurred. To the best of our knowledge, this is the largest publicly available actionable warning dataset for C programming language to date. The dataset is suited for use in machine/deep learning models and can support a wide range of tasks, such as actionable warning identification and vulnerability detection. Furthermore, we have released our dataset1 and a general framework for collecting actionable warnings on GitHub2 to facilitate other researchers to replicate our work and validate their innovative ideas.
What problem does this paper attempt to address?