Abstract:In recent years, the number of vulnerabilities discovered and publicly disclosed has shown a sharp upward trend. However, the value of exploitation of vulnerabilities varies for attackers, considering that only a small fraction of vulnerabilities are exploited. Therefore, the realization of quick exclusion of the non-exploitable vulnerabilities and optimal patch prioritization on limited resources has become imperative for organizations. Recent works using machine learning techniques predict exploited vulnerabilities by extracting features from open-source intelligence (OSINT). However, in the face of explosive growth of vulnerability information, there is room for improvement in the application of past methods to multiple threat intelligence. A more general method is needed to deal with various threat intelligence sources. Moreover, in previous methods, traditional text processing methods were used to deal with vulnerability related descriptions, which only grasped the static statistical characteristics but ignored the context and the meaning of the words of the text. To address these challenges, we propose an exploit prediction model, which is based on a combination of fastText and LightGBM algorithm and called fastEmbed. We replicate key portions of the state-of-the-art work of exploit prediction and use them as benchmark models. Our model outperforms the baseline model whether in terms of the generalization ability or the prediction ability without temporal intermixing with an average overall improvement of 6.283% by learning the embedding of vulnerability-related text on extremely imbalanced data sets. Besides, in terms of predicting the exploits in the wild, our model also outperforms the baseline model with an F1 measure of 0.586 on the minority class (33.577% improvement over the work using features from darkweb/deepweb). The results demonstrate that the model can improve the ability to describe the exploitability of vulnerabilities and predict exploits in the wild effectively.

Predicting Vulnerable Components via Text Mining or Software Metrics? An Effort-Aware Perspective

Combining Software Metrics and Text Features for Vulnerable File Prediction

Predictive Models in Software Engineering: Challenges and Opportunities

Categorizing and Predicting Invalid Vulnerabilities on Common Vulnerabilities and Exposures

Software vulnerability prediction using text analysis techniques

Towards More Practical Automation of Vulnerability Assessment

An empirical study of text-based machine learning models for vulnerability detection

Explaining the Contributing Factors for Vulnerability Detection in Machine Learning

Automated Configuration Bug Report Prediction Using Text Mining.

Vulnerability Severity Prediction Model for Software Based on Markov Chain.

Predicting Exploitation of Disclosed Software Vulnerabilities Using Open-source Data

A Mining Approach to Obtain the Software Vulnerability Characteristics

On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models

FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm

Towards Developing and Analysing Metric-Based Software Defect Severity Prediction Model

A multi-target approach to estimate software vulnerability characteristics and severity scores

Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be?

Exploring better alternatives to size metrics for explainable software defect prediction

Predicting Object-Oriented Software Maintainability Using Multivariate Adaptive Regression Splines

Using Temporal and Semantic Developer-Level Information to Predict Maintenance Activity Profiles