Abstract:The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record object containing key information about a vulnerability that is extracted from an advisory, such those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions. We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search OSS repositories for the commits that fix known vulnerabilities.

A Machine Learning Approach for Vulnerability Curation.

Function-Level Vulnerability Detection Through Fusing Multi-Modal Knowledge

Categorizing and Predicting Invalid Vulnerabilities on Common Vulnerabilities and Exposures

Automated software vulnerability detection with machine learning

Explaining the Contributing Factors for Vulnerability Detection in Machine Learning

Combining Software Metrics and Text Features for Vulnerable File Prediction

An empirical study of text-based machine learning models for vulnerability detection

VulCurator: A Vulnerability-Fixing Commit Detector

Improving Data Curation of Software Vulnerability Patches through Uncertainty Quantification

A Survey on Automated Software Vulnerability Detection Using Machine Learning and Deep Learning

Learning-based Models for Vulnerability Detection: An Extensive Study

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

A Comparative Study of Deep Learning-Based Vulnerability Detection System

Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection

S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration

Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories

M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection

Software security with natural language processing and vulnerability scoring using machine learning approach

Automated Software Vulnerability Assessment with Concept Drift

Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study