Abstract:We present a comprehensive dataset of Java vulnerability-fixing commits (VFCs) to advance research in Java vulnerability analysis. Our dataset, derived from thousands of open-source Java projects on GitHub, comprises two variants: JavaVFC and JavaVFC-extended. The dataset was constructed through a rigorous process involving heuristic rules and multiple rounds of manual labeling. We initially used keywords to filter candidate VFCs based on commit messages, then refined this keyword set through iterative manual labeling. The final labeling round achieved a precision score of 0.7 among three annotators. We applied the refined keyword set to 34,321 open-source Java repositories with over 50 GitHub stars, resulting in JavaVFC with 784 manually verified VFCs and JavaVFC-extended with 16,837 automatically identified VFCs. Both variants are presented in a standardized JSONL format for easy access and analysis. This dataset supports various research endeavors, including VFC identification, fine-grained vulnerability detection, and automated vulnerability repair. The JavaVFC and JavaVFC-extended are publicly available at <a class="link-external link-https" href="https://zenodo.org/records/13731781" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiencies of the existing Java Vulnerability Fix Commits (VFCs) datasets. Specifically, the author points out that the current datasets have the following limitations: 1. **Insufficient Quantity and Diversity**: Many existing datasets rely on sources such as the National Vulnerability Database (NVD) or CVEDetails. Although these datasets are reliable, they are often limited in terms of quantity and diversity. 2. **Incomplete Language Coverage**: Most of the existing vulnerability datasets mainly focus on C/C++ projects. As a programming language widely used in large - scale enterprise applications, Java has an obvious gap in this regard. 3. **Limited Scope**: Although there are some vulnerability datasets for Java projects, the number of projects they cover and the number of vulnerability fix commits are still relatively small and cannot fully represent the real - world situation. To solve these problems, the author constructs a new, high - quality Java VFC dataset, aiming to cover larger and more diverse vulnerability fix commits. Through this dataset, researchers can better develop and evaluate automated vulnerability detection and repair tools, thereby improving the security and stability of software. ### Main Contributions 1. **Dataset Construction**: - **JAVAVFC**: A high - precision dataset containing 784 VFCs verified by at least two annotators. - **JAVAVFC - EXTENDED**: A larger - scale dataset containing 16,837 VFCs filtered out by heuristic rules, sourced from 34,321 open - source Java projects. 2. **Keyword Set**: Introduced a carefully curated set of keywords for efficiently screening VFCs from commit messages. This set of keywords is not only helpful for the current research but can also be extended in future work. ### Application of the Dataset This dataset supports multiple research directions, including but not limited to: - **VFC Detection**: Identify commits that fix vulnerabilities, helping developers discover potential vulnerabilities in the continuous integration pipeline. - **Vulnerability Detection**: Extract vulnerability information at different granularities, such as file - level or function - level code, to support more in - depth vulnerability analysis. - **Vulnerability Repair**: Show how developers repair vulnerable code, providing a reference for automated repair tools. - **Empirical Research**: Analyze Common Weakness Enumeration (CWE) categories, provide insights into common security problems, and guide best practices for secure coding. ### Threats and Limitations The author also discusses the threats that the dataset may face, including challenges in internal validity, construct validity, and external validity. For example, keyword - based searches may miss commits that do not explicitly use the selected keywords, or may incorrectly include some commits that match the keywords but are actually irrelevant. In addition, the dataset is limited to Java projects, which may limit its universality in other programming languages. In conclusion, this paper fills the gaps in existing datasets by constructing a high - quality Java VFC dataset, providing a valuable resource for future vulnerability detection and repair research.

JavaVFC: Java Vulnerability Fixing Commits from Open-source Software

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations.

Understanding Vulnerability Inducing Commits of the Linux Kernel

A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software

V-SZZ: Automatic Identification of Version Ranges Affected by CVE Vulnerabilities

Patchmatch: A Tool for Locating Patches of Open Source Project Vulnerabilities

Fine-grained Commit-level Vulnerability Type Prediction by CWE Tree Structure.

VFCFinder: Seamlessly Pairing Security Advisories and Patches

CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics

ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software

VulZoo: A Comprehensive Vulnerability Intelligence Dataset

A ground-truth dataset of real security patches

JFinder: A Novel Architecture for Java Vulnerability Identification Based Quad Self-Attention and Pre-training Mechanism

Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories

A Quantitative Study of Security Bug Fixes of GitHub Repositories

How Effective Are Neural Networks for Fixing Security Vulnerabilities

VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation

VCIPR: Vulnerable Code is Identifiable When a Patch is Released (Hacker's Perspective)

VulCurator: A Vulnerability-Fixing Commit Detector

SPI: Automated Identification of Security Patches via Commits