StagedVulBERT: Multi-Granular Vulnerability Detection with a Novel Pre-trained Code Model

Yuan Jiang,Yujian Zhang,Xiaohong Su,Christoph Treude,Tiantian Wang

2024-10-08

Abstract:The emergence of pre-trained model-based vulnerability detection methods has significantly advanced the field of automated vulnerability detection. However, these methods still face several challenges, such as difficulty in learning effective feature representations of statements for fine-grained predictions and struggling to process overly long code sequences. To address these issues, this study introduces StagedVulBERT, a novel vulnerability detection framework that leverages a pre-trained code language model and employs a coarse-to-fine strategy. The key innovation and contribution of our research lies in the development of the CodeBERT-HLS component within our framework, specialized in hierarchical, layered, and semantic encoding. This component is designed to capture semantics at both the token and statement levels simultaneously, which is crucial for achieving more accurate multi-granular vulnerability detection. Additionally, CodeBERT-HLS efficiently processes longer code token sequences, making it more suited to real-world vulnerability detection. Comprehensive experiments demonstrate that our method enhances the performance of vulnerability detection at both coarse- and fine-grained levels. Specifically, in coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods. At the fine-grained level, our method achieves a Top-5% accuracy of 65.69%, which outperforms the state-of-the-art methods by up to 75.17%.

Cryptography and Security,Software Engineering

What problem does this paper attempt to address?

The paper attempts to address the challenges faced by existing pre-trained model-based vulnerability detection methods in fine-grained prediction and handling overly long code sequences. Specifically: 1. **Difficulty in Fine-Grained Prediction**: Existing methods primarily view code as a series of tokens, using basic Transformer architectures to capture relationships between tokens. However, relying solely on token-level features makes it difficult to achieve excellent performance in vulnerability detection at the statement level. 2. **Inconsistency Between Training and Prediction Objectives**: If methods designed for multi-granularity vulnerability detection use only function-level (coarse-grained) labels to train the model, aiming to achieve fine-grained detection in unseen programs, the difference in objectives between training and prediction may lead to an inability to accurately locate vulnerable lines. 3. **Limited Ability to Handle Long Code Sequences**: Existing pre-trained code language models (such as CodeBERT or UniXcoder) are typically limited by sequence length, being able to handle a maximum of 512 tokens. Therefore, code functions exceeding the maximum length will be truncated, which may result in the loss of useful information. To address these issues, the study proposes StagedVulBERT, a staged vulnerability detection framework that introduces the CodeBERT-HLS component. CodeBERT-HLS is specifically designed for hierarchical, layered, and semantic encoding, aiming to capture both token and statement-level semantics, which is crucial for achieving more accurate multi-granularity vulnerability detection. Additionally, CodeBERT-HLS can efficiently handle longer code token sequences, making it more suitable for practical vulnerability detection tasks. Through this framework, the researchers hope to achieve better performance in both coarse-grained and fine-grained vulnerability detection.

StagedVulBERT: Multi-Granular Vulnerability Detection with a Novel Pre-trained Code Model

StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model

Function-Level Vulnerability Detection Through Fusing Multi-Modal Knowledge

VulD-SG: Enhancing Code Vulnerability Detection Via Combining Deep Sequence and Graph Model

Vul-LMGNNs: Fusing Language Models and Online-Distilled Graph Neural Networks for Code Vulnerability Detection

Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning

mVulPreter: A Multi-Granularity Vulnerability Detection System With Interpretations

Multi-view Pre-trained Model for Code Vulnerability Identification

Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation

Detecting software vulnerabilities using Language Models

XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection

Making vulnerability prediction more practical: Prediction, categorization, and localization

VulANalyzeR: Explainable Binary Vulnerability Detection with Multi-task Learning and Attentional Graph Convolution

Automated Vulnerability Detection Using Deep Learning Technique

PATVD:Vulnerability Detection Based on Pre-training Techniques and Adversarial Training.

VulGraB: Graph‐embedding‐based code vulnerability detection with bi‐directional gated graph neural network

Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Enhancing Deep Learning-based Vulnerability Detection by Building Behavior Graph Model

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Software Vulnerabilities Detection Based on a Pre-trained Language Model

DeepVulSeeker: A novel vulnerability identification framework via code graph structure and pre-training mechanism