Abstract:Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like `PaLM2' and `GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Can large language models (LLMs) be reliable assistants for identifying and reasoning about security vulnerabilities?** Specifically, the paper explores this problem in the following aspects: 1. **Research Background and Motivation**: - Large language models (LLMs) perform well in programming tasks such as code generation, documentation writing, and debugging. However, there is a lack of a systematic evaluation benchmark for the automatic repair of security vulnerabilities. - Developers often miss security - related issues during the development process and are insufficient in their ability to detect and fix these issues early. 2. **Research Objectives**: - Evaluate whether LLMs can reliably identify and reason about security - related vulnerabilities. - Provide a comprehensive framework (SecLLMHolmes) for automatically evaluating the performance of LLMs in vulnerability detection. 3. **Research Methods**: - Constructed a dataset containing 228 code scenarios, covering 8 key vulnerability types in C and Python. - Used the SecLLMHolmes framework to evaluate eight state - of - the - art LLMs from eight different dimensions, including deterministic responses, performance within parameter ranges, prompt diversity, reasoning fidelity, evaluation of multiple vulnerability types, evaluation of different code difficulty levels, robustness to code enhancement, and application in real - world projects. 4. **Research Findings**: - LLMs are unstable in identifying security vulnerabilities, with non - deterministic responses, errors, and unfaithful reasoning. - Even the most advanced models (such as PaLM2 and GPT - 4), when function or variable names are changed or library functions are added, will produce wrong answers, reaching proportions of 26% and 17% respectively. - LLMs perform poorly in real - world projects and are unable to effectively detect vulnerabilities. 5. **Conclusion**: - Current LLMs are not yet ready to be general - purpose security assistants for automatically detecting vulnerabilities. - Further improvement of LLMs is required to enhance their reliability and accuracy in security - related tasks. In summary, this paper aims to evaluate the capabilities of LLMs in security vulnerability detection, reveal the existing problems, and provide directions and benchmarks for future research.

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

Software Vulnerability and Functionality Assessment using LLMs

Large Language Models and Code Security: A Systematic Literature Review

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems

Emerging Security Challenges of Large Language Models

A survey on Large Language Model (LLM) security and privacy: The Good, The Bad, and The Ugly

Exploring Advanced Methodologies in Security Evaluation for LLMs

Can LLMs Patch Security Issues?

Assessment of LLM Responses to End-user Security Questions

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection