Code Vulnerability Detection: A Comparative Analysis of Emerging Large Language Models

Shaznin Sultana,Sadia Afreen,Nasir U. Eisty
2024-09-17
Abstract:The growing trend of vulnerability issues in software development as a result of a large dependence on open-source projects has received considerable attention recently. This paper investigates the effectiveness of Large Language Models (LLMs) in identifying vulnerabilities within codebases, with a focus on the latest advancements in LLM technology. Through a comparative analysis, we assess the performance of emerging LLMs, specifically Llama, CodeLlama, Gemma, and CodeGemma, alongside established state-of-the-art models such as BERT, RoBERTa, and GPT-3. Our study aims to shed light on the capabilities of LLMs in vulnerability detection, contributing to the enhancement of software security practices across diverse open-source repositories. We observe that CodeGemma achieves the highest F1-score of 58\ and a Recall of 87\, amongst the recent additions of large language models to detect software security vulnerabilities.
Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the effectiveness of large language models (LLMs) in detecting code vulnerabilities, especially in open - source projects. Specifically, the goals of the paper include: 1. **Compare the performance of emerging large language models with existing advanced models**: By comparing new models such as Llama, CodeLlama, Gemma, and CodeGemma with existing advanced models such as BERT, RoBERTa, and GPT - 3, evaluate their performance in detecting code vulnerabilities. 2. **Explore the differences between natural - language models and code - specific models**: Research the performance differences between natural - language - based LLMs (such as Llama 2, Gemma) and code - specific LLMs (such as CodeLlama, CodeGemma) in the vulnerability - detection task. 3. **Verify the potential of the latest LLMs in vulnerability detection**: Through experiments, verify whether these new models can show performance comparable to or even better than traditional methods or existing deep - learning models in the field of software security. 4. **Provide insights for improving software - security practices**: By evaluating the performance of different LLMs, provide suggestions and guidance for improving the software security of open - source projects. ### Specific Problems and Goals of the Paper #### Research Question 1 (RQ1) - **How effective are emerging large language models in detecting code vulnerabilities?** - The experimental objects include recently introduced LLMs (such as Llama 2, Gemma, CodeLlama, CodeGemma). #### Research Question 2 (RQ2) - **Can natural - language - based LLMs outperform code - based LLMs?** - Compare the performance of natural - language LLMs (such as Llama 2, Gemma) and code - specific LLMs (such as CodeLlama, CodeGemma). #### Research Question 3 (RQ3) - **How do these findings compare with existing advanced models?** - Compare the results with existing advanced models (such as BERT, RoBERTa, GPT - 3). #### Research Question 4 (RQ4) - **What are the advantages and disadvantages of new LLMs compared with existing models?** - Analyze the performance of new LLMs in practical applications and discuss their advantages and limitations. ### Method Overview The paper uses a dataset named DiverseVul, which contains a large number of C/C++ code fragments with and without vulnerabilities. To ensure a fair comparison, the authors pre - processed the data, including cleaning, balancing the class distribution, and using specific prompt - engineering techniques to optimize the model input. In addition, the selected LLMs were fine - tuned and trained to meet the requirements of the code - vulnerability - detection task. ### Results and Conclusions Through experiments, the paper has drawn the following main conclusions: - **CodeGemma** performs well in recall and F1 - score, reaching 87% and 58% respectively, but is slightly inferior in precision. - **Llama 2** has the highest overall accuracy, reaching 65%, but its other metrics are not as good as those of CodeGemma. - **Gemma** performs well in accuracy, reaching 65%, but is slightly lower in recall and F1 - score. - Traditional models such as **CodeLlama** and **GPT - 2 Base** are still competitive in some cases, especially in terms of accuracy. In general, this study shows the potential of emerging LLMs in code - vulnerability detection, but also points out some of their limitations and challenges in practical applications. Future research can further optimize these models to better apply them to actual software - security scenarios.