Software Vulnerabilities Detection Based on a Pre-trained Language Model

Wenlin Xu,Tong Li,Jinsong Wang,Haibo Duan,Yahui Tang
DOI: https://doi.org/10.1109/trustcom60117.2023.00129
2024-01-01
Abstract:Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.
What problem does this paper attempt to address?