Abstract:There is an increasing trend to mine vulnerabilities from software repositories and use machine learning techniques to automatically detect software vulnerabilities. A fundamental but unresolved research question is: how do different factors in the mining and learning process impact the accuracy of identifying vulnerabilities in software projects of varying characteristics? Substantial research has been dedicated in this area, including source code static analysis, software repository mining, and NLP-based machine learning. However, practitioners lack experience regarding the key factors for building a baseline model of the state-of-the-art. In addition, there lacks of experience regarding the transferability of the vulnerability signatures from project to project. This study investigates how the combination of different vulnerability features and three representative machine learning models impact the accuracy of vulnerability detection in 17 real-world projects. We examine two types of vulnerability representations: 1) code features extracted through NLP with varying tokenization strategies and three different embedding techniques (bag-of-words, word2vec, and fastText) and 2) a set of eight architectural metrics that capture the abstract design of the software systems. The three machine learning algorithms include a random forest model, a support vector machines model, and a residual neural network model. The analysis shows a recommended baseline model with signatures extracted through bag-of-words embedding, combined with the random forest, consistently increases the detection accuracy by about 4% compared to other combinations in all 17 projects. Furthermore, we observe the limitation of transferring vulnerability signatures across domains based on our experiments.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How do different factors affect the accuracy of vulnerability detection in software projects during the machine - learning process?** Specifically, the authors focus on: 1. **The influence of different feature representation methods on vulnerability detection**: including features extracted from source code (such as code features obtained through natural language processing techniques) and metrics reflecting the complexity of software architecture. 2. **The influence of different machine - learning models on the effect of vulnerability detection**: The performance of three representative models, namely random forest, support vector machine and residual neural network, has been studied. 3. **The effectiveness of cross - project transfer of vulnerability features**: It is explored whether the vulnerability features learned in one project can be effectively transferred to other projects. ### Main research questions of the paper To answer the above - mentioned core questions, the author proposes the following specific research questions (RQs): - **RQ1**: Will filtering out symbols and comments in the code affect the results of vulnerability detection? - The author answers this question by comparing the detection performance with and without comments and symbols. - **RQ2**: Which embedding technique performs better in multiple software projects? - The effects of three embedding techniques, namely Bag - of - Words (BOW), Word2Vec and FastText, are studied. - **RQ3**: Can the architecture metrics used to measure the complexity of software structure improve the accuracy of vulnerability detection? - The author uses NLP - based code embedding and architecture metrics for training respectively, and compares their effects; also tries to combine the two to observe whether there is a better performance. - **RQ4**: Which machine - learning model performs better in different projects? - The performance of three models, namely random forest, support vector machine and residual neural network, are compared. - **RQ5**: How is the transfer ability of the learned features when predicting vulnerabilities across projects? - The transfer effect of the learned features among different projects is evaluated by cross - validation. ### Experimental design and results The author conducts experiments on 17 actual projects, with a total of 408 experiments and ten hypotheses evaluated. The experimental results show that: - **95% of the evaluation indicators (such as precision, recall, F1 - score, etc.) are all higher than 0.77**. - Using **Bag - of - Words model embedding** combined with **random forest model** has increased the detection accuracy by about 4% on average in all 17 projects. - Although the architecture metrics are helpful to improve the detection accuracy, their contribution is not as great as that of code embedding features. ### Conclusion This research provides a valuable reference for constructing baseline models, helps researchers understand which factors are most critical for vulnerability detection, and lays the foundation for future vulnerability detection work. At the same time, the research also reveals the limitations of the current cross - project transfer of vulnerability features, and points out the direction for further research.

Explaining the Contributing Factors for Vulnerability Detection in Machine Learning

Function-Level Vulnerability Detection Through Fusing Multi-Modal Knowledge

Categorizing and Predicting Invalid Vulnerabilities on Common Vulnerabilities and Exposures

An empirical study of text-based machine learning models for vulnerability detection

The impact factors on the performance of machine learning-based vulnerability detection: A comparative study

Combining Software Metrics and Text Features for Vulnerable File Prediction

Automated software vulnerability detection with machine learning

A Comparative Study of Deep Learning-Based Vulnerability Detection System

A Mining Approach to Obtain the Software Vulnerability Characteristics

A Survey on Automated Software Vulnerability Detection Using Machine Learning and Deep Learning

Learning-based Models for Vulnerability Detection: An Extensive Study

An extensive study of the effects of different deep learning models on code vulnerability detection in Python code

Software Vulnerability Mining and Analysis Based on Deep Learning

Predicting Exploitation of Disclosed Software Vulnerabilities Using Open-source Data

Predicting Vulnerable Components via Text Mining or Software Metrics? An Effort-Aware Perspective

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

Towards Effectively Detecting and Explaining Vulnerabilities Using Large Language Models

Representation vs. Model: What Matters Most for Source Code Vulnerability Detection

A performance evaluation of deep‐learnt features for software vulnerability detection

Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection