A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

Benjamin Steenhoek,Md Mahbubur Rahman,Monoshi Kumar Roy,Mirza Sanjida Alam,Earl T. Barr,Wei Le
2024-03-26
Abstract:Large Language Models (LLMs) have demonstrated great potential for code generation and other software engineering tasks. Vulnerability detection is of crucial importance to maintaining the security, integrity, and trustworthiness of software systems. Precise vulnerability detection requires reasoning about the code, making it a good case study for exploring the limits of LLMs' reasoning capabilities. Although recent work has applied LLMs to vulnerability detection using generic prompting techniques, their full capabilities for this task and the types of errors they make when explaining identified vulnerabilities remain unclear.
Software Engineering,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the capabilities of large language models (LLMs) in vulnerability detection. Specifically, the paper focuses on the following aspects: 1. **Evaluating the capabilities of existing LLMs**: Researchers investigated 11 state-of-the-art LLMs and assessed their performance in vulnerability detection tasks. 2. **Optimizing prompt methods**: By systematically searching for the best prompt methods, including techniques like context learning and chain-of-thought, three new prompt methods were proposed. 3. **Analyzing error types**: A comprehensive analysis and classification of 287 model inference instances were conducted to reveal common error types in LLMs during vulnerability detection. 4. **Comparing with human performance**: The vulnerability localization capabilities of LLMs were compared with those of professional software engineers to understand their strengths and weaknesses. ### Main Contributions 1. **Designed three new prompt templates**: These templates combined information from vulnerability repair patches, CVE descriptions, and static analyzers. 2. **Comprehensively evaluated state-of-the-art LLMs**: Their performance in vulnerability detection tasks was tested, including tests on vulnerability/repair code pairs. 3. **Detailed analysis of model errors**: An analysis of error types in 287 LLM responses was provided, offering a dataset for future research. 4. **Compared with human performance**: The debugging capabilities of LLMs were evaluated on the DbgBench benchmark and compared with human developers. ### Research Findings - **Effectiveness of prompt methods**: Basic prompts and random context prompt methods performed the best, ranking first in 4 models. Embedding similarity and chain-of-thought from static analysis (CoT-SA) methods performed best in 3-4 models. - **Model performance**: Despite their excellent performance in other tasks, LLMs performed poorly in vulnerability detection, with balanced accuracy ranging from 0.5 to 0.63, close to the baseline level of random guessing. Most models failed to distinguish between vulnerable and fixed versions 76% of the time. - **Error types**: 57% of LLM responses contained errors, mainly in code understanding, hallucinations, logic, and common-sense knowledge. LLMs particularly struggled with correctly identifying boundary/null checks. - **Comparison with humans**: On the DbgBench benchmark, LLMs correctly located only 6 out of 27 vulnerabilities, all of which were correctly diagnosed by at least one human participant. GPT-3 performed the best, correctly locating 4 vulnerabilities. ### Conclusion Although LLMs show great potential in other tasks, their performance in vulnerability detection, which requires complex reasoning, is still unsatisfactory. The research findings highlight the need for further studies to improve the vulnerability detection capabilities of LLMs.