Harnessing the Power of LLMs in Source Code Vulnerability Detection

Andrew A Mahyari
2024-08-07
Abstract:Software vulnerabilities, caused by unintentional flaws in source code, are a primary root cause of cyberattacks. Static analysis of source code has been widely used to detect these unintentional defects introduced by software developers. Large Language Models (LLMs) have demonstrated human-like conversational abilities due to their capacity to capture complex patterns in sequential data, such as natural languages. In this paper, we harness LLMs' capabilities to analyze source code and detect known vulnerabilities. To ensure the proposed vulnerability detection method is universal across multiple programming languages, we convert source code to LLVM IR and train LLMs on these intermediate representations. We conduct extensive experiments on various LLM architectures and compare their accuracy. Our comprehensive experiments on real-world and synthetic codes from NVD and SARD demonstrate high accuracy in identifying source code vulnerabilities.
Software Engineering,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is vulnerability detection in software source code. Specifically, software vulnerabilities caused by unintentional defects in the source code are the main source of cyber - attacks, and these vulnerabilities may lead to serious social and economic losses. Although traditional static analysis methods are widely used to detect these defects, they have some limitations, such as being unable to accurately identify specific vulnerable lines and lacking universality when dealing with different programming languages. To overcome these problems, this research proposes a method based on large - language models (LLMs) to analyze source code and detect known vulnerabilities. By converting the source code into LLVM intermediate representation (IR) and then training LLMs to analyze these intermediate representations, this method aims to provide a universal vulnerability - detection solution across multiple programming languages. Experimental results show that this method has high vulnerability - detection accuracy on real - world and synthetic - code datasets. The key steps include: 1. **Source - code conversion**: Uniformly convert the source code of different programming languages into LLVM IR to ensure the universality of the method. 2. **Feature extraction**: Extract syntactic and semantic features from LLVM IR to generate intermediate representations (iSeVCs). 3. **Model training**: Use a custom tokenizer to convert the intermediate representations into unique identifiers and train LLMs for vulnerability detection. 4. **Performance evaluation**: Verify the effectiveness of the proposed method by conducting comparative experiments with existing methods (such as VulDeeLocator and LSTM - based methods). The goal of this research is to utilize the powerful capabilities of LLMs to improve the accuracy and universality of source - code vulnerability detection.