Comparing Unidirectional, Bidirectional, and Word2vec Models for Discovering Vulnerabilities in Compiled Lifted Code

Gary A. McCully,John D. Hastings,Shengjie Xu,Adam Fortier

2024-09-26

Abstract:Ransomware and other forms of malware cause significant financial and operational damage to organizations by exploiting long-standing and often difficult-to-detect software vulnerabilities. To detect vulnerabilities such as buffer overflows in compiled code, this research investigates the application of unidirectional transformer-based embeddings, specifically GPT-2. Using a dataset of LLVM functions, we trained a GPT-2 model to generate embeddings, which were subsequently used to build LSTM neural networks to differentiate between vulnerable and non-vulnerable code. Our study reveals that embeddings from the GPT-2 model significantly outperform those from bidirectional models of BERT and RoBERTa, achieving an accuracy of 92.5% and an F1-score of 89.7%. LSTM neural networks were developed with both frozen and unfrozen embedding model layers. The model with the highest performance was achieved when the embedding layers were unfrozen. Further, the research finds that, in exploring the impact of different optimizers within this domain, the SGD optimizer demonstrates superior performance over Adam. Overall, these findings reveal important insights into the potential of unidirectional transformer-based approaches in enhancing cybersecurity defenses.

Cryptography and Security,Computation and Language,Machine Learning,Software Engineering

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to detect vulnerabilities in the compiled code, especially software vulnerabilities that are difficult to detect, such as stack buffer overflows (CWE - 121). To meet this challenge, researchers explored using embeddings generated by unidirectional transformers (such as GPT - 2), bidirectional transformers (such as BERT and RoBERTa), and non - transformer embedding models (such as Word2Vec) to train neural networks to distinguish between vulnerable and non - vulnerable code. Specifically, the main objectives of the paper include: 1. **Comparing different types of embedding models**: evaluating the performance of unidirectional transformers (GPT - 2), bidirectional transformers (BERT, RoBERTa), and non - transformer embedding models (Skip - Gram, CBOW) in identifying vulnerabilities in the compiled code. 2. **Optimizing neural network performance**: by adjusting different optimizers (such as SGD and Adam) and their parameters, exploring which configuration can better improve the performance of the LSTM neural network. 3. **Investigating the impact of freezing and unfreezing the embedding layer**: studying the impact of freezing or unfreezing the embedding layer during the training process on the final model performance. The research results show that the embeddings generated based on GPT - 2 are significantly superior to other models in identifying vulnerabilities in the compiled code. In particular, when using the SGD optimizer and not freezing the embedding layer, an accuracy rate of 92.5% and an F1 score of 89.7% are achieved. These findings provide important insights for improving network security defenses, especially when dealing with compiled binary files, enabling more effective detection of potential security threats.

Comparing Unidirectional, Bidirectional, and Word2vec Models for Discovering Vulnerabilities in Compiled Lifted Code

Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in Lifted Compiled Code

Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

An extensive study of the effects of different deep learning models on code vulnerability detection in Python code

Detecting software vulnerabilities using Language Models

Vul-LMGNNs: Fusing Language Models and Online-Distilled Graph Neural Networks for Code Vulnerability Detection

VDDL: A Deep Learning-Based Vulnerability Detection Model for Smart Contracts.

Deep-Learning-based Vulnerability Detection in Binary Executables

Transformer-Based Language Models for Software Vulnerability Detection

Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

Codesentry: Revolutionizing Real-Time Software Vulnerability Detection With Optimized GPT Framework

Foundational Models for Malware Embeddings Using Spatio-Temporal Parallel Convolutional Networks

Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

V2W-BERT: A Framework for Effective Hierarchical Multiclass Classification of Software Vulnerabilities

An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

VulGraB: Graph‐embedding‐based code vulnerability detection with bi‐directional gated graph neural network