MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation

Chao Ni,Liyu Shen,Xiaohu Yang,Yan Zhu,Shaohua Wang
DOI: https://doi.org/10.1145/3643991.3644886
2024-06-18
Abstract:We constructed a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. Specifically, we collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites. We adopt advanced tools to ensure the extracted code integrality and enrich the code with four different transformed representations. In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. Thus, MegaVul can be used for a variety of software security-related tasks including detecting vulnerabilities and assessing vulnerability severity. All information is stored in the JSON format for easy usage. MegaVul is publicly available on GitHub and will be continuously updated. It can be easily extended to other programming languages.
Cryptography and Security,Software Engineering
What problem does this paper attempt to address?
This paper aims to address the dataset issues present in software vulnerability detection. Specifically, existing vulnerability datasets have the following limitations: 1. Unrealistic vulnerabilities (e.g., the SARD dataset is synthetically generated); 2. Unrealistic data distribution (e.g., balanced distribution in the Devign dataset); 3. Limited diversity (e.g., limited projects and vulnerability types in the ReVeal dataset); 4. Limited newly disclosed vulnerabilities (e.g., the Big-Vul dataset only covers up to 2019); 5. Low data quality (e.g., incomplete functions, incorrectly merged functions, missing commit information, etc.). To address the above issues, the authors have constructed a new dataset named MegaVul, which features high quality, rich data, and multi-dimensional characteristics. MegaVul crawls all available descriptive information from the public vulnerability database CVE and mines related product code commits through CVE references to extract relevant vulnerability information. Additionally, MegaVul provides various code representations, including abstract versions and graph representations, to enrich the dataset's content. The dataset includes 17,380 vulnerabilities collected from January 2006 to October 2023, covering 169 different types of vulnerabilities in 992 open-source repositories. These improvements enable MegaVul to play a significant role in software security-related tasks, such as vulnerability detection and identification of vulnerability repair patches.