Abstract:While automated vulnerability detection techniques have made promising progress in detecting security vulnerabilities, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Overall, LLMs across all scales and families show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. They are significantly better at detecting vulnerabilities only requiring intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.

Assessing Large Language Model’s knowledge of threat behavior in MITRE ATT&CK

The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums

On the Uses of Large Language Models to Interpret Ambiguous Cyberattack Descriptions

LLMs Killed the Script Kiddie: How Agents Supported by Large Language Models Change the Landscape of Network Threat Testing

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

From Text to MITRE Techniques: Exploring the Malicious Use of Large Language Models for Generating Cyber Attack Payloads

Getting pwn'd by AI: Penetration Testing with Large Language Models

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

Advancing TTP Analysis: Harnessing the Power of Large Language Models with Retrieval Augmented Generation

Actionable Cyber Threat Intelligence using Knowledge Graphs and Large Language Models

Assessment of LLM Responses to End-user Security Questions

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models

A Preliminary Study on Using Large Language Models in Software Pentesting

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions