An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

Jiaxin Yu,Peng Liang,Yujia Fu,Amjed Tahir,Mojtaba Shahin,Chong Wang,Yangxiao Cai
2024-10-05
Abstract:Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defects during code review. Specifically, we evaluated the performance of six LLMs under five different prompts and compared them with state-of-theart static analysis tools. We also performed linguistic and regression analyses for the best-performing LLM to identify quality problems in its responses and factors influencing its performance. Our findings show that: (1) existing pre-trained LLMs have limited capability in security code review but? significantly outperform the state-of-the-art static analysis tools. (2) GPT-4 performs best among all LLMs when provided with a CWE list for reference. (3) GPT-4 frequently generates responses that are verbose or not compliant with the task requirements given in the prompts. (4) GPT-4 is more adept at identifying security defects in code files with fewer tokens, containing functional logic, or written by developers with less involvement in the project.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the potential application of large language models (LLMs) in code security review to address the issues present in existing security analysis tools. Specifically, the paper focuses on the following aspects: 1. **Problems with Existing Security Analysis Tools**: - **Poor Generalization**: Existing tools struggle to adapt to different types of code. - **High False Positive Rate**: Tools frequently report non-existent security flaws. - **Coarse Detection Granularity**: The detection results provided by the tools are not detailed enough, making it difficult to pinpoint the exact location of the flaws. 2. **Potential Applications of LLMs**: - **Improving Detection Accuracy**: By using LLMs, it is expected to more accurately detect security flaws in the code. - **Reducing False Positive Rate**: LLMs may be able to reduce false positives and improve the reliability of detection. - **Providing Detailed Information**: LLMs can generate detailed explanations to help developers better understand the specifics of security flaws. 3. **Research Methods**: - **Experimental Design**: The paper evaluates the performance of six different LLMs under five different prompts and compares them with existing static analysis tools. - **Performance Evaluation**: Various evaluation metrics, such as I-Score, IH-Score, and M-Score, are used to measure the performance of LLMs in security code review. - **Influence Factor Analysis**: Linguistic and regression analysis of the best-performing LLM is conducted to identify quality issues in its responses and factors affecting its performance. ### Main Findings 1. **Existing pre-trained LLMs have limited capabilities in security code review but significantly outperform existing static analysis tools**. 2. **GPT-4 performs best when provided with a CWE list as a reference**. 3. **GPT-4 often generates lengthy or task-irrelevant responses**. 4. **GPT-4 is better at identifying security flaws in code files that contain fewer code tokens, have clear functional logic, or are written by developers with lower project engagement**. ### Contributions 1. **A fine-grained evaluation of currently popular LLMs in security code review**. 2. **Measurement of the impact of LLMs' randomness on the consistency of security code review**. 3. **Identification of quality issues in responses generated by top LLMs, revealing challenges in security code review**. 4. **First analysis of factors affecting LLM performance in security code review**. Through this research, the paper provides important insights and recommendations for applying LLMs to security code review, helping to further improve the efficiency and quality of code review.