Abstract:The changesets (or patches) that fix open source software vulnerabilities form critical datasets for various machine learning security-enhancing applications, such as automated vulnerability patching and silent fix detection. These patch datasets are derived from extensive collections of historical vulnerability fixes, maintained in databases like the Common Vulnerabilities and Exposures list and the National Vulnerability Database. However, since these databases focus on rapid notification to the security community, they contain significant inaccuracies and omissions that have a negative impact on downstream software security quality assurance tasks. In this paper, we propose an approach employing Uncertainty Quantification (UQ) to curate datasets of publicly-available software vulnerability patches. Our methodology leverages machine learning models that incorporate UQ to differentiate between patches based on their potential utility. We begin by evaluating a number of popular UQ techniques, including Vanilla, Monte Carlo Dropout, and Model Ensemble, as well as homoscedastic and heteroscedastic models of noise. Our findings indicate that Model Ensemble and heteroscedastic models are the best choices for vulnerability patch datasets. Based on these UQ modeling choices, we propose a heuristic that uses UQ to filter out lower quality instances and select instances with high utility value from the vulnerability dataset. Using our approach, we observe an improvement in predictive performance and significant reduction of model training time (i.e., energy consumption) for a state-of-the-art vulnerability prediction model.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the quality problems existing in publicly available software vulnerability patching datasets. Specifically, the author focuses on how to improve the quality and practicality of these datasets by introducing Uncertainty Quantification (UQ) techniques. The following are the specific problems that the paper attempts to solve: 1. **Quality problems of existing datasets**: - Current databases used to record and repair open - source software vulnerabilities (such as the CVE list and NVD) can quickly notify the security community, but there are significant inaccuracies and omission problems. These problems have a negative impact on downstream software security quality assurance tasks. - Vulnerability patching datasets used in existing research often rely on manual verification or heuristic - based automated methods, which are difficult to scale and are prone to introducing new errors. 2. **Distinction between data quality and practicality**: - The paper points out that data quality and data practicality are two related but different concepts. Existing work usually lacks a systematic method to evaluate the quality and practicality of vulnerability patching data. 3. **Improving automated data curation capabilities**: - The paper proposes a UQ - based method to select high - quality and highly practical vulnerability patching data in an automated manner. This helps to reduce the workload of manual verification and improve the prediction performance and computational efficiency of machine - learning models. ### Method overview To solve the above problems, the paper proposes the following methods: - **Introducing Uncertainty Quantification (UQ)**: Use UQ techniques to evaluate and filter instances in the vulnerability patching dataset, thereby improving the quality and practicality of the dataset. - **Comparing different UQ techniques**: Including methods such as Vanilla, Monte Carlo Dropout, and Model Ensemble, and combining homoscedastic and heteroscedastic models to evaluate their performance when dealing with vulnerability patching datasets. - **Designing an algorithm**: Propose a UQ - based algorithm that uses epistemic uncertainty and data uncertainty to select high - quality and highly practical vulnerability patching data. ### Main contributions 1. **Propose a UQ - based automated and systematic vulnerability patching data curation algorithm**, which for the first time simultaneously considers epistemic uncertainty and data uncertainty. 2. **Provide empirical evidence** to support the effectiveness of the proposed UQ techniques in improving data quality and practicality. 3. **Significantly improve prediction performance and reduce computational costs**, especially when using state - of - the - art software vulnerability prediction models. Through these methods, the paper not only improves the quality of existing vulnerability patching datasets but also provides new directions and technical support for future related research.

Improving Data Curation of Software Vulnerability Patches through Uncertainty Quantification

Categorizing and Predicting Invalid Vulnerabilities on Common Vulnerabilities and Exposures

V-SZZ: Automatic Identification of Version Ranges Affected by CVE Vulnerabilities

Patchmatch: A Tool for Locating Patches of Open Source Project Vulnerabilities

Fine-grained Commit-level Vulnerability Type Prediction by CWE Tree Structure.

A Survey on Uncertainty Quantification Methods for Deep Learning

ReposVul: A Repository-Level High-Quality Vulnerability Dataset

VulZoo: A Comprehensive Vulnerability Intelligence Dataset

Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

Heat Equation Stein Variational Ensemble: Rethinking and Advancing Uncertainty-Aware Soft Sensor Modeling

A Survey on Uncertainty Quantification Methods for Deep Neural Networks: An Uncertainty Source Perspective

SPI: Automated Identification of Security Patches via Commits

Precise (un)affected Version Analysis for Web Vulnerabilities

Data Quality Issues in Vulnerability Detection Datasets

Pre-trained Model-based Automated Software Vulnerability Repair: How Far are We?

CompVPD: Iteratively Identifying Vulnerability Patches Based on Human Validation Results with a Precise Context

Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Harnessing Data Augmentation to Quantify Uncertainty in the Early Estimation of Single-Photon Source Quality

Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes

Predicting Exploitation of Disclosed Software Vulnerabilities Using Open-source Data