JavaVFC: Java Vulnerability Fixing Commits from Open-source Software

Tan Bui,Yan Naing Tun,Yiran Cheng,Ivana Clairine Irsan,Ting Zhang,Hong Jin Kang
DOI: https://doi.org/10.48550/arXiv.2409.05576
2024-09-09
Abstract:We present a comprehensive dataset of Java vulnerability-fixing commits (VFCs) to advance research in Java vulnerability analysis. Our dataset, derived from thousands of open-source Java projects on GitHub, comprises two variants: JavaVFC and JavaVFC-extended. The dataset was constructed through a rigorous process involving heuristic rules and multiple rounds of manual labeling. We initially used keywords to filter candidate VFCs based on commit messages, then refined this keyword set through iterative manual labeling. The final labeling round achieved a precision score of 0.7 among three annotators. We applied the refined keyword set to 34,321 open-source Java repositories with over 50 GitHub stars, resulting in JavaVFC with 784 manually verified VFCs and JavaVFC-extended with 16,837 automatically identified VFCs. Both variants are presented in a standardized JSONL format for easy access and analysis. This dataset supports various research endeavors, including VFC identification, fine-grained vulnerability detection, and automated vulnerability repair. The JavaVFC and JavaVFC-extended are publicly available at <a class="link-external link-https" href="https://zenodo.org/records/13731781" rel="external noopener nofollow">this https URL</a>.
Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of the existing Java Vulnerability Fix Commits (VFCs) datasets. Specifically, the author points out that the current datasets have the following limitations: 1. **Insufficient Quantity and Diversity**: Many existing datasets rely on sources such as the National Vulnerability Database (NVD) or CVEDetails. Although these datasets are reliable, they are often limited in terms of quantity and diversity. 2. **Incomplete Language Coverage**: Most of the existing vulnerability datasets mainly focus on C/C++ projects. As a programming language widely used in large - scale enterprise applications, Java has an obvious gap in this regard. 3. **Limited Scope**: Although there are some vulnerability datasets for Java projects, the number of projects they cover and the number of vulnerability fix commits are still relatively small and cannot fully represent the real - world situation. To solve these problems, the author constructs a new, high - quality Java VFC dataset, aiming to cover larger and more diverse vulnerability fix commits. Through this dataset, researchers can better develop and evaluate automated vulnerability detection and repair tools, thereby improving the security and stability of software. ### Main Contributions 1. **Dataset Construction**: - **JAVAVFC**: A high - precision dataset containing 784 VFCs verified by at least two annotators. - **JAVAVFC - EXTENDED**: A larger - scale dataset containing 16,837 VFCs filtered out by heuristic rules, sourced from 34,321 open - source Java projects. 2. **Keyword Set**: Introduced a carefully curated set of keywords for efficiently screening VFCs from commit messages. This set of keywords is not only helpful for the current research but can also be extended in future work. ### Application of the Dataset This dataset supports multiple research directions, including but not limited to: - **VFC Detection**: Identify commits that fix vulnerabilities, helping developers discover potential vulnerabilities in the continuous integration pipeline. - **Vulnerability Detection**: Extract vulnerability information at different granularities, such as file - level or function - level code, to support more in - depth vulnerability analysis. - **Vulnerability Repair**: Show how developers repair vulnerable code, providing a reference for automated repair tools. - **Empirical Research**: Analyze Common Weakness Enumeration (CWE) categories, provide insights into common security problems, and guide best practices for secure coding. ### Threats and Limitations The author also discusses the threats that the dataset may face, including challenges in internal validity, construct validity, and external validity. For example, keyword - based searches may miss commits that do not explicitly use the selected keywords, or may incorrectly include some commits that match the keywords but are actually irrelevant. In addition, the dataset is limited to Java projects, which may limit its universality in other programming languages. In conclusion, this paper fills the gaps in existing datasets by constructing a high - quality Java VFC dataset, providing a valuable resource for future vulnerability detection and repair research.