Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models

Peter Ince,Xiapu Luo,Jiangshan Yu,Joseph K. Liu,Xiaoning Du
2024-07-12
Abstract:In this paper, we test the hypothesis that although OpenAI's GPT-4 performs well generally, we can fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. We fine-tune two models from Meta's Code Llama and a dataset of 17k prompts, Detect Llama - Foundation and Detect Llama - Instruct, and we also fine-tune OpenAI's GPT-3.5 Turbo model (GPT-3.5FT). We then evaluate these models, plus a random baseline, on a testset we develop against GPT-4, and GPT-4 Turbo's, detection of eight vulnerabilities from the dataset and the two top identified vulnerabilities - and their weighted F1 scores. We find that for binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of $0.776$ and $0.68$, outperforming both GPT-4 and GPT-4 Turbo, $0.66$ and $0.675$. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo in both weighted F1 for all vulnerabilities ($0.61$ and $0.56$ respectively against GPT-4's $0.218$ and GPT-4 Turbo's $0.243$) and weighted F1 for the top two identified vulnerabilities ($0.719$ for GPT-3.5FT, $0.674$ for Detect Llama - Foundation against GPT-4's $0.363$ and GPT-4 Turbo's $0.429$).
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is vulnerability detection in smart contracts. Specifically, the researchers tested whether fine - tuning open - source large - language models (LLMs) could outperform OpenAI's GPT - 4 model in smart - contract - vulnerability detection. The paper mentions that although GPT - 4 performs well on many tasks, the researchers hypothesized that by using code - specific open - source models for fine - tuning, the performance of smart - contract - vulnerability detection could be improved. ### Research Background and Motivation With the development of decentralized finance (DeFi), a large amount of capital is locked in smart contracts, which has attracted the attention of malicious actors and led to multiple smart - contract - attack incidents. These attack incidents highlight the importance of quickly and accurately detecting smart - contract vulnerabilities. Currently, there are multiple automated smart - contract - vulnerability - detection tools, such as static - analysis tools and dynamic - analysis tools, but each has its limitations: - **Static - analysis tools**: They are fast but prone to false positives. - **Dynamic - analysis tools**: They are highly accurate but take a long time to detect. Therefore, the researchers hope to develop a tool that combines the advantages of static analysis and dynamic analysis, which can not only detect vulnerabilities quickly but also reduce false positives. ### Research Methods To achieve this goal, the researchers chose Meta's Code Llama model and fine - tuned it using a dataset containing 17,000 prompts. In addition, they also fine - tuned OpenAI's GPT - 3.5 Turbo model and created a random baseline for comparison. Finally, the researchers evaluated the performance of these models using a custom - made test set. ### Main Contributions 1. **Release of open - source model**: The researchers released the fine - tuned Code Llama 34b model as a smart - contract - vulnerability - detection tool. 2. **Evaluation of GPT - 3.5 Turbo**: The researchers evaluated the performance of GPT - 3.5 Turbo as a smart - contract - vulnerability - detection tool. 3. **Comparison of GPT - 4 and GPT - 4 Turbo**: The researchers showed how by fine - tuning open - source models and GPT - 3.5 Turbo, they could significantly outperform GPT - 4 and GPT - 4 Turbo in specific detection tasks. 4. **Publication of data set and training set**: The researchers made their open - source model and the prompt set used for training and evaluation public, so that future research can further develop on this basis. ### Experimental Results The experimental results show that the fine - tuned GPT - 3.5 Turbo performs best in the binary - classification task (i.e., whether a smart contract has a vulnerability), with a weighted F1 - score of 0.776. And the model based on Code Llama 34b Foundation also performs well, with a weighted F1 - score of 0.68, both exceeding the performance of GPT - 4 and GPT - 4 Turbo. ### Conclusion By fine - tuning code - specific open - source large - language models, the researchers have successfully improved the performance of smart - contract - vulnerability detection, surpassing the existing GPT - 4 model. This provides a new direction for the development of future smart - contract - security tools.