To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

Benjamin Steenhoek,Md Mahbubur Rahman,Monoshi Kumar Roy,Mirza Sanjida Alam,Hengbo Tong,Swarna Das,Earl T. Barr,Wei Le
2025-01-08
Abstract:In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code. Our error analysis on LLM responses shows that the models struggle to reason about the code semantics relevant to identifying vulnerabilities, especially subtle semantic differences caused by small textual changes. We explored prominent models and training settings to understand their effects on vulnerability detection performance -- including better prompts, larger models, more pre-training data, and fine-tuning -- but none led to significant improvements. This raises the question of whether simply scaling training data and model size will allow us to "solve" complex code reasoning tasks like vulnerability detection, or if a fundamental shift in modeling and training techniques is required. We also explored adding domain knowledge to prompts; although it helped certain models understand some code semantics, vulnerability detection requires multi-step reasoning, and these models still failed in steps, such as reasoning about variable relations. Our results suggest that new models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection. We speculate that auto-regressive pre-training on source code may not effectively extract code semantics, especially on the current pretraining mixtures, in which execution data is scarce. Success on vulnerability detection as a code reasoning task can benefit many areas of software engineering such as debugging, test input generation, and program repair. Our code and data are available at <a class="link-external link-https" href="https://doi.org/10.6084/m9.figshare.27368025" rel="external noopener nofollow">this https URL</a>.
Software Engineering,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to effectively detect vulnerabilities in code using large language models (LLMs)**. Specifically, the paper points out that although existing large language models perform well in natural language processing, mathematical reasoning, and code - generation tasks, they perform poorly in identifying code vulnerabilities. The paper experimentally evaluated the performance of 14 state - of - the - art large language models (SOTA LLMs) on the vulnerability - detection task and found that the balanced accuracy of these models was only 50 - 55%, close to the level of random guessing. Even some models pre - trained with a large amount of source code could not significantly improve their performance on the vulnerability - detection task. ### Main problems: 1. **Complex code - reasoning challenges**: Vulnerability detection requires not only multi - step analysis but also an accurate understanding of code semantics. For example, to identify vulnerabilities such as buffer overflow (BOF) or null - pointer dereference (NPD), details such as variable relationships, boundary checks, string operations, and pointer operations need to be understood. 2. **Limitations of existing models**: The research found that even increasing the model size, using more training data, or fine - tuning could not significantly improve the model's performance on the vulnerability - detection task. This indicates that simply expanding the amount of training data and the model size may not be sufficient to solve this complex task, and fundamental changes in modeling and training methods may be required. 3. **Scarcity of execution data**: Current autoregressive pre - training methods may not be able to effectively extract execution semantics from code text, especially when there is a lack of sufficient execution data in the pre - training data. This makes it difficult for the model to understand the actual running behavior of the code. ### Main contributions of the paper: 1. **Clarifying vulnerability detection as a complex reasoning challenge**: The paper analyzes in detail the multi - step reasoning process required for vulnerability detection and points out the deficiencies of existing models in this regard. 2. **Revealing the performance bottlenecks of existing models**: Through manual analysis of hundreds of model responses, the paper reveals that the models have difficulties in all reasoning stages, especially in understanding semantics involving boundary/NULL checks, string operations, and pointer handling. 3. **Exploring improvement directions**: The paper explores ways to mitigate certain types of errors by adding domain knowledge, improving prompt methods, etc., but points out that these improvements have not significantly improved overall performance. Therefore, the paper suggests that future research should focus on new model architectures, training methods, and more specific execution - data pre - training. In conclusion, this paper aims to reveal the limitations of existing large language models in the vulnerability - detection task and provide directions for future improvements.