StagedVulBERT: Multi-Granular Vulnerability Detection with a Novel Pre-trained Code Model

Yuan Jiang,Yujian Zhang,Xiaohong Su,Christoph Treude,Tiantian Wang
2024-10-08
Abstract:The emergence of pre-trained model-based vulnerability detection methods has significantly advanced the field of automated vulnerability detection. However, these methods still face several challenges, such as difficulty in learning effective feature representations of statements for fine-grained predictions and struggling to process overly long code sequences. To address these issues, this study introduces StagedVulBERT, a novel vulnerability detection framework that leverages a pre-trained code language model and employs a coarse-to-fine strategy. The key innovation and contribution of our research lies in the development of the CodeBERT-HLS component within our framework, specialized in hierarchical, layered, and semantic encoding. This component is designed to capture semantics at both the token and statement levels simultaneously, which is crucial for achieving more accurate multi-granular vulnerability detection. Additionally, CodeBERT-HLS efficiently processes longer code token sequences, making it more suited to real-world vulnerability detection. Comprehensive experiments demonstrate that our method enhances the performance of vulnerability detection at both coarse- and fine-grained levels. Specifically, in coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods. At the fine-grained level, our method achieves a Top-5% accuracy of 65.69%, which outperforms the state-of-the-art methods by up to 75.17%.
Cryptography and Security,Software Engineering
What problem does this paper attempt to address?
The paper attempts to address the challenges faced by existing pre-trained model-based vulnerability detection methods in fine-grained prediction and handling overly long code sequences. Specifically: 1. **Difficulty in Fine-Grained Prediction**: Existing methods primarily view code as a series of tokens, using basic Transformer architectures to capture relationships between tokens. However, relying solely on token-level features makes it difficult to achieve excellent performance in vulnerability detection at the statement level. 2. **Inconsistency Between Training and Prediction Objectives**: If methods designed for multi-granularity vulnerability detection use only function-level (coarse-grained) labels to train the model, aiming to achieve fine-grained detection in unseen programs, the difference in objectives between training and prediction may lead to an inability to accurately locate vulnerable lines. 3. **Limited Ability to Handle Long Code Sequences**: Existing pre-trained code language models (such as CodeBERT or UniXcoder) are typically limited by sequence length, being able to handle a maximum of 512 tokens. Therefore, code functions exceeding the maximum length will be truncated, which may result in the loss of useful information. To address these issues, the study proposes StagedVulBERT, a staged vulnerability detection framework that introduces the CodeBERT-HLS component. CodeBERT-HLS is specifically designed for hierarchical, layered, and semantic encoding, aiming to capture both token and statement-level semantics, which is crucial for achieving more accurate multi-granularity vulnerability detection. Additionally, CodeBERT-HLS can efficiently handle longer code token sequences, making it more suitable for practical vulnerability detection tasks. Through this framework, the researchers hope to achieve better performance in both coarse-grained and fine-grained vulnerability detection.