Vulnerability Detection with Code Language Models: How Far Are We?

Yangruibo Ding,Yanjun Fu,Omniyyah Ibrahim,Chawin Sitawarin,Xinyun Chen,Basel Alomair,David Wagner,Baishakhi Ray,Yizheng Chen
2024-07-10
Abstract:In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the effectiveness and reliability issues of code language models (code LMs) in vulnerability detection. Specifically, the paper reveals several key problems in existing vulnerability datasets and evaluation methods, and proposes new solutions to improve the performance of these models in practical applications. The following are the main problems in the paper and their solutions: ### 1. Problems with existing vulnerability datasets #### Data quality problems - **Label noise**: Many existing datasets use automatic annotation methods, resulting in inaccurate labels. For example, datasets such as BigVul assume that each fix commit modifies only one vulnerable function, but in reality, a single commit may fix multiple vulnerabilities or contain other irrelevant changes. - **Data duplication**: There are a large number of duplicate samples in existing datasets, which can lead to data leakage between the training and test sets, making the model performance evaluation inaccurate. #### Low label accuracy - **Automated vs. manual annotation**: Automated annotation is low - cost but has poor precision, while manual annotation is highly precise but costly and limited in scale. For example, SVEN is currently the most accurate manually - annotated dataset, but it only contains 1.6k samples and 9 CWE types. #### Data leakage - **Code replication**: There are a large number of identical code fragments between different datasets, leading to distorted evaluation results. - **Time travel**: Since the datasets randomly divide the training, validation, and test sets, it may lead to training with future data and testing on past data, further affecting the reliability of the evaluation results. ### 2. Problems with existing evaluation metrics - **Accuracy**: It is not suitable in vulnerability detection because most of the code is not vulnerable code, and simply predicting "non - vulnerable" can achieve a high accuracy rate. - **F1 - score**: Although widely used to evaluate unbalanced datasets, in practical applications, the F1 - score fails to reflect the challenges of the tool in preventing false positives. ### 3. Proposed solutions #### New dataset P RIME VUL P RIME VUL improves the existing datasets in the following ways: - **High - quality data collection**: Introduce a strict deduplication strategy to ensure no data leakage between the training and test sets. - **High - precision labels**: Adopt two novel annotation techniques: - **P RIME VUL - ONEFUNC**: Only mark a function as a vulnerable function when the commit modifies only one function. - **P RIME VUL - NVDCHECK**: Use the CVE descriptions in the NVD database for expert - level annotation to ensure label accuracy. - **Large - scale expansion**: Contains 6,968 vulnerable functions and 228,800 benign functions, covering 140 CWE types. #### New evaluation guidelines - **Chronological division**: Divide the dataset in chronological order to reduce the risk of data leakage. - **Vulnerability Detection Score (VD - S)**: Introduce a new evaluation metric VD - S to measure the false negative rate at a fixed false positive rate. - **Pair - wise evaluation**: By comparing the vulnerable code with its fixed version, evaluate the model's ability to distinguish similar codes. ### 4. Experimental results The paper evaluated multiple code language models and found that the performance of existing models on P RIME VUL is far below expectations. Even the most advanced models such as GPT - 3.5 and GPT - 4 failed to effectively identify vulnerabilities. This indicates that the current models' vulnerability detection capabilities in the real world are still insufficient and more innovative research methods are required. ### Summary By in - depth analysis of the defects in existing datasets and evaluation methods, the paper proposes a new dataset P RIME VUL and an evaluation framework, aiming to provide a more reliable and more practical vulnerability detection benchmark. The experimental results show that the performance of existing models in real - world scenarios is far from meeting the deployment requirements, emphasizing the importance of further research.