Abstract:Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defects during code review. Specifically, we evaluated the performance of six LLMs under five different prompts and compared them with state-of-theart static analysis tools. We also performed linguistic and regression analyses for the best-performing LLM to identify quality problems in its responses and factors influencing its performance. Our findings show that: (1) existing pre-trained LLMs have limited capability in security code review but? significantly outperform the state-of-the-art static analysis tools. (2) GPT-4 performs best among all LLMs when provided with a CWE list for reference. (3) GPT-4 frequently generates responses that are verbose or not compliant with the task requirements given in the prompts. (4) GPT-4 is more adept at identifying security defects in code files with fewer tokens, containing functional logic, or written by developers with less involvement in the project.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the potential application of large language models (LLMs) in code security review to address the issues present in existing security analysis tools. Specifically, the paper focuses on the following aspects: 1. **Problems with Existing Security Analysis Tools**: - **Poor Generalization**: Existing tools struggle to adapt to different types of code. - **High False Positive Rate**: Tools frequently report non-existent security flaws. - **Coarse Detection Granularity**: The detection results provided by the tools are not detailed enough, making it difficult to pinpoint the exact location of the flaws. 2. **Potential Applications of LLMs**: - **Improving Detection Accuracy**: By using LLMs, it is expected to more accurately detect security flaws in the code. - **Reducing False Positive Rate**: LLMs may be able to reduce false positives and improve the reliability of detection. - **Providing Detailed Information**: LLMs can generate detailed explanations to help developers better understand the specifics of security flaws. 3. **Research Methods**: - **Experimental Design**: The paper evaluates the performance of six different LLMs under five different prompts and compares them with existing static analysis tools. - **Performance Evaluation**: Various evaluation metrics, such as I-Score, IH-Score, and M-Score, are used to measure the performance of LLMs in security code review. - **Influence Factor Analysis**: Linguistic and regression analysis of the best-performing LLM is conducted to identify quality issues in its responses and factors affecting its performance. ### Main Findings 1. **Existing pre-trained LLMs have limited capabilities in security code review but significantly outperform existing static analysis tools**. 2. **GPT-4 performs best when provided with a CWE list as a reference**. 3. **GPT-4 often generates lengthy or task-irrelevant responses**. 4. **GPT-4 is better at identifying security flaws in code files that contain fewer code tokens, have clear functional logic, or are written by developers with lower project engagement**. ### Contributions 1. **A fine-grained evaluation of currently popular LLMs in security code review**. 2. **Measurement of the impact of LLMs' randomness on the consistency of security code review**. 3. **Identification of quality issues in responses generated by top LLMs, revealing challenges in security code review**. 4. **First analysis of factors affecting LLM performance in security code review**. Through this research, the paper provides important insights and recommendations for applying LLMs to security code review, helping to further improve the efficiency and quality of code review.

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

Security Code Review by LLMs: A Deep Dive into Responses

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Software Vulnerability and Functionality Assessment using LLMs

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

A survey on Large Language Model (LLM) security and privacy: The Good, The Bad, and The Ugly

Security Attacks on LLM-based Code Completion Tools

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code

LMs: Understanding Code Syntax and Semantics for Code Analysis

Exploring Advanced Methodologies in Security Evaluation for LLMs

A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems

Large Language Models and Code Security: A Systematic Literature Review