VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Yu Liu,Lang Gao,Mingxin Yang,Yu Xie,Ping Chen,Xiaojin Zhang,Wei Chen
2024-08-21
Abstract:Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at <a class="link-external link-https" href="https://github.com/Sweetaroo/VulDetectBench" rel="external noopener nofollow">this https URL</a>.
Cryptography and Security,Artificial Intelligence,Software Engineering
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of large language models (LLMs) in the task of code vulnerability detection. Although existing LLMs perform excellently in understanding, generating, and summarizing code, their performance in specialized vulnerability detection tasks has not been comprehensively evaluated. Therefore, this study introduces a new benchmark tool—VulDetectBench, aimed at systematically assessing the capabilities of LLMs in vulnerability detection. Specifically, VulDetectBench designs 5 tasks of increasing difficulty to evaluate the performance of LLMs in the following aspects: 1. **Vulnerability Presence Detection**: Determine whether there are vulnerabilities in the code. 2. **Vulnerability Type Inference**: Identify the type of vulnerability in the code (CWE classification). 3. **Critical Data Objects and Functions Identification**: Identify data objects and function calls that may lead to vulnerabilities. 4. **Vulnerability Root Cause Localization**: Precisely locate the root cause of the vulnerability. 5. **Vulnerability Trigger Point Localization**: Determine the specific trigger location of the vulnerability. Through these tasks, researchers hope to gain a comprehensive understanding of the performance of different LLMs in various sub-tasks of vulnerability detection, thereby providing a foundation for future improvements. This benchmark covers not only open-source models but also some closed-source models, evaluating a total of 17 different models. The study found that existing models perform well in simple vulnerability presence detection and type inference tasks, but there is still significant room for improvement in specific vulnerability analysis tasks.