Abstract:While automated vulnerability detection techniques have made promising progress in detecting security vulnerabilities, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Overall, LLMs across all scales and families show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. They are significantly better at detecting vulnerabilities only requiring intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

How Far Have We Gone in Vulnerability Detection Using Large Language Models

Can Large Language Models Find And Fix Vulnerable Software?

Code Vulnerability Detection: A Comparative Analysis of Emerging Large Language Models

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Large Language Models and Code Security: A Systematic Literature Review

A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

Vulnerability Detection in Popular Programming Languages with Language Models

A Preliminary Study on Using Large Language Models in Software Pentesting

An Empirical Study of Automated Vulnerability Localization with Large Language Models

Software Vulnerability and Functionality Assessment using LLMs

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

Beyond Static Tools: Evaluating Large Language Models for Cryptographic Misuse Detection