Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

Yuejun Guo,Constantinos Patsakis,Qiang Hu,Qiang Tang,Fran Casino
2024-08-29
Abstract:The significant increase in software production driven by automation and faster development lifecycles has resulted in a corresponding surge in software vulnerabilities. In parallel, the evolving landscape of software vulnerability detection, highlighting the shift from traditional methods to machine learning and large language models (LLMs), provides massive opportunities at the cost of resource-demanding computations. This paper thoroughly analyses LLMs' capabilities in detecting vulnerabilities within source code by testing models beyond their usual applications to study their potential in cybersecurity tasks. We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs, three of which were further fine-tuned on a dataset that we compiled. Our dataset, alongside five state-of-the-art benchmark datasets, were used to create a pipeline to leverage a binary classification task, namely classifying code into vulnerable and non-vulnerable. The findings highlight significant variations in classification accuracy across benchmarks, revealing the critical influence of fine-tuning in enhancing the detection capabilities of small LLMs over their larger counterparts, yet only in the specific scenarios in which they were trained. Further experiments and analysis also underscore the issues with current benchmark datasets, particularly around mislabeling and their impact on model training and performance, which raises concerns about the current state of practice. We also discuss the road ahead in the field suggesting strategies for improved model training and dataset curation.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large language models (LLMs) in software vulnerability detection, especially in situations beyond their normal application range. Specifically, the researchers hope to understand their potential in cybersecurity tasks by testing these models and explore whether strategies such as fine - tuning can improve the detection performance of small - scale LLMs, enabling them to outperform large - scale LLMs in specific scenarios. ### Main problems and goals 1. **Evaluating the capabilities of LLMs**: - Research whether LLMs can detect vulnerabilities in source code. - Compare the performance of open - source models specifically trained for vulnerability detection with general - purpose LLMs. 2. **The impact of fine - tuning**: - Explore the effect of fine - tuning on improving the detection accuracy of small - scale LLMs. - Analyze whether fine - tuned small - scale LLMs can be superior to large - scale LLMs in specific situations. 3. **Quality issues of existing datasets**: - Evaluate the label accuracy of existing benchmark datasets and its impact on model training and performance. - Reveal the problems existing in current practice, such as mislabeling and its impact on model training. 4. **Future development directions**: - Propose strategies for improving model training and dataset construction. - Discuss how to further improve the performance and reliability of LLMs in software vulnerability detection. ### Experimental design To achieve the above goals, the researchers conducted the following experiments: - **Dataset selection**: Six different datasets were used, including self - compiled datasets and five other existing benchmark datasets. - **Model selection**: Six open - source models specifically trained for vulnerability detection and six general - purpose LLMs were selected for comparison. - **Fine - tuning experiment**: Some models were fine - tuned to observe their performance changes. - **Evaluation metrics**: Precision, recall, and F1 - score were used as evaluation metrics. ### Results and discussion The research results show that: - There are significant differences in the performance of different models on different datasets. - Fine - tuning can significantly improve the detection performance of small - scale LLMs, but there may be a loss in generalization ability. - There are problems with inaccurate labels in existing datasets, which may affect the training effect and final performance of the model. ### Conclusion This study reveals the potential and challenges of LLMs in software vulnerability detection and proposes suggestions for improving model training and dataset quality. Future research needs to further explore how to improve the generalization ability and detection accuracy of LLMs, especially when facing diverse practical application scenarios. ### Formula representation The formulas involved in the paper are represented in Markdown format as follows: - Precision: \( P=\frac{TP}{TP + FP} \) - Recall: \( R = \frac{TP}{TP+FN} \) - F1 - score: \( F1=2\times\frac{P\times R}{P + R} \) where \( TP \) represents True Positive, \( FP \) represents False Positive, and \( FN \) represents False Negative.