Abstract:As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of large language models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the effectiveness and potential of large - language models (LLMs) in software vulnerability detection. Specifically, the paper aims to: 1. **Quantitatively evaluate the performance of LLMs**: By introducing a comprehensive vulnerability benchmark dataset, VulBench, the paper conducts large - scale experiments on 16 LLMs and 6 state - of - the - art deep - learning models and static analysis tools to quantitatively evaluate their performance in vulnerability detection. 2. **Fill the gaps in existing research**: Although LLMs have shown strong capabilities in multiple fields, their quantitative evaluation in the field of vulnerability detection is still insufficient. Through detailed experimental results, the paper fills this research gap. 3. **Improve the quality and accuracy of datasets**: Existing vulnerability datasets are often of low quality and poor accuracy, resulting in low detection accuracy. To this end, the paper constructs a high - quality VulBench dataset, which covers vulnerabilities in CTF challenges and real - world applications and provides detailed vulnerability types and root - cause annotations. 4. **Explore the untapped potential of LLMs in vulnerability detection**: Through experiments, the paper finds that some LLMs are superior to traditional deep - learning methods in vulnerability detection, revealing the untapped potential of LLMs in this field. ### Main contributions of the paper - **First large - scale study**: Quantitatively measures the performance of 16 LLMs in the field of vulnerability detection and compares them with state - of - the - art deep - learning models and static analysis tools. - **Introduction of the VulBench dataset**: Solves the quality problems of existing datasets and provides a more accurate and comprehensive dataset for evaluating vulnerability detection models and provides a natural - language description for each vulnerability. - **Reveal the untapped potential of LLMs**: The research results provide new insights and directions for future research, demonstrating the potential advantages of LLMs in vulnerability detection. - **Publish the dataset publicly**: To promote future research, the paper releases the VulBench dataset on GitHub. ### Summary By constructing a high - quality dataset and conducting large - scale experiments, the paper systematically evaluates the performance of LLMs in vulnerability detection, reveals their untapped potential, and lays the foundation for future related research. This not only enhances our understanding of the application of LLMs in the field of software security but also paves new ways for the development of automated vulnerability detection technology.

How Far Have We Gone in Vulnerability Detection Using Large Language Models

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

An Empirical Study of Automated Vulnerability Localization with Large Language Models

LLbezpeky: Leveraging Large Language Models for Vulnerability Detection

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead

Multitask-based Evaluation of Open-Source LLM on Software Vulnerability

Code Vulnerability Detection: A Comparative Analysis of Emerging Large Language Models

Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

Automated Software Vulnerability Patching using Large Language Models

Recent Advances in Attack and Defense Approaches of Large Language Models

LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning

Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning

Exploring Vulnerabilities and Protections in Large Language Models: A Survey

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models