VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Yu Liu,Lang Gao,Mingxin Yang,Yu Xie,Ping Chen,Xiaojin Zhang,Wei Chen

2024-08-21

Abstract:Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at <a class="link-external link-https" href="https://github.com/Sweetaroo/VulDetectBench" rel="external noopener nofollow">this https URL</a>.

Cryptography and Security,Artificial Intelligence,Software Engineering

What problem does this paper attempt to address?

The problem this paper attempts to address is the evaluation of large language models (LLMs) in the task of code vulnerability detection. Although existing LLMs perform excellently in understanding, generating, and summarizing code, their performance in specialized vulnerability detection tasks has not been comprehensively evaluated. Therefore, this study introduces a new benchmark tool—VulDetectBench, aimed at systematically assessing the capabilities of LLMs in vulnerability detection. Specifically, VulDetectBench designs 5 tasks of increasing difficulty to evaluate the performance of LLMs in the following aspects: 1. **Vulnerability Presence Detection**: Determine whether there are vulnerabilities in the code. 2. **Vulnerability Type Inference**: Identify the type of vulnerability in the code (CWE classification). 3. **Critical Data Objects and Functions Identification**: Identify data objects and function calls that may lead to vulnerabilities. 4. **Vulnerability Root Cause Localization**: Precisely locate the root cause of the vulnerability. 5. **Vulnerability Trigger Point Localization**: Determine the specific trigger location of the vulnerability. Through these tasks, researchers hope to gain a comprehensive understanding of the performance of different LLMs in various sub-tasks of vulnerability detection, thereby providing a foundation for future improvements. This benchmark covers not only open-source models but also some closed-source models, evaluating a total of 17 different models. The study found that existing models perform well in simple vulnerability presence detection and type inference tasks, but there is still significant room for improvement in specific vulnerability analysis tasks.

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

How Far Have We Gone in Vulnerability Detection Using Large Language Models

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Multitask-based Evaluation of Open-Source LLM on Software Vulnerability

Towards Effectively Detecting and Explaining Vulnerabilities Using Large Language Models

An Empirical Study of Automated Vulnerability Localization with Large Language Models

Vulnerability Detection with Code Language Models: How Far Are We?

A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning

Vul-LMGNNs: Fusing Language Models and Online-Distilled Graph Neural Networks for Code Vulnerability Detection

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning

Automated Software Vulnerability Patching using Large Language Models

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection