Abstract:In this paper, we test the hypothesis that although OpenAI's GPT-4 performs well generally, we can fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. We fine-tune two models from Meta's Code Llama and a dataset of 17k prompts, Detect Llama - Foundation and Detect Llama - Instruct, and we also fine-tune OpenAI's GPT-3.5 Turbo model (GPT-3.5FT). We then evaluate these models, plus a random baseline, on a testset we develop against GPT-4, and GPT-4 Turbo's, detection of eight vulnerabilities from the dataset and the two top identified vulnerabilities - and their weighted F1 scores. We find that for binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of $0.776$ and $0.68$, outperforming both GPT-4 and GPT-4 Turbo, $0.66$ and $0.675$. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo in both weighted F1 for all vulnerabilities ($0.61$ and $0.56$ respectively against GPT-4's $0.218$ and GPT-4 Turbo's $0.243$) and weighted F1 for the top two identified vulnerabilities ($0.719$ for GPT-3.5FT, $0.674$ for Detect Llama - Foundation against GPT-4's $0.363$ and GPT-4 Turbo's $0.429$).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is vulnerability detection in smart contracts. Specifically, the researchers tested whether fine - tuning open - source large - language models (LLMs) could outperform OpenAI's GPT - 4 model in smart - contract - vulnerability detection. The paper mentions that although GPT - 4 performs well on many tasks, the researchers hypothesized that by using code - specific open - source models for fine - tuning, the performance of smart - contract - vulnerability detection could be improved. ### Research Background and Motivation With the development of decentralized finance (DeFi), a large amount of capital is locked in smart contracts, which has attracted the attention of malicious actors and led to multiple smart - contract - attack incidents. These attack incidents highlight the importance of quickly and accurately detecting smart - contract vulnerabilities. Currently, there are multiple automated smart - contract - vulnerability - detection tools, such as static - analysis tools and dynamic - analysis tools, but each has its limitations: - **Static - analysis tools**: They are fast but prone to false positives. - **Dynamic - analysis tools**: They are highly accurate but take a long time to detect. Therefore, the researchers hope to develop a tool that combines the advantages of static analysis and dynamic analysis, which can not only detect vulnerabilities quickly but also reduce false positives. ### Research Methods To achieve this goal, the researchers chose Meta's Code Llama model and fine - tuned it using a dataset containing 17,000 prompts. In addition, they also fine - tuned OpenAI's GPT - 3.5 Turbo model and created a random baseline for comparison. Finally, the researchers evaluated the performance of these models using a custom - made test set. ### Main Contributions 1. **Release of open - source model**: The researchers released the fine - tuned Code Llama 34b model as a smart - contract - vulnerability - detection tool. 2. **Evaluation of GPT - 3.5 Turbo**: The researchers evaluated the performance of GPT - 3.5 Turbo as a smart - contract - vulnerability - detection tool. 3. **Comparison of GPT - 4 and GPT - 4 Turbo**: The researchers showed how by fine - tuning open - source models and GPT - 3.5 Turbo, they could significantly outperform GPT - 4 and GPT - 4 Turbo in specific detection tasks. 4. **Publication of data set and training set**: The researchers made their open - source model and the prompt set used for training and evaluation public, so that future research can further develop on this basis. ### Experimental Results The experimental results show that the fine - tuned GPT - 3.5 Turbo performs best in the binary - classification task (i.e., whether a smart contract has a vulnerability), with a weighted F1 - score of 0.776. And the model based on Code Llama 34b Foundation also performs well, with a weighted F1 - score of 0.68, both exceeding the performance of GPT - 4 and GPT - 4 Turbo. ### Conclusion By fine - tuning code - specific open - source large - language models, the researchers have successfully improved the performance of smart - contract - vulnerability detection, surpassing the existing GPT - 4 model. This provides a new direction for the development of future smart - contract - security tools.

Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models

Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives

Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

When ChatGPT Meets Smart Contract Vulnerability Detection: How Far Are We?

Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We?

Do you still need a manual smart contract audit?

Detection of malicious smart contracts by fine‐tuning GPT‐3

Can Large Language Models Find And Fix Vulnerable Software?

Automated Smart Contract Vulnerability Detection using Fine-tuned Large Language Models

VDDL: A Deep Learning-Based Vulnerability Detection Model for Smart Contracts.

Retrieval Augmented Generation Integrated Large Language Models in Smart Contract Vulnerability Detection

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

Smart-LLaMA: Two-Stage Post-Training of Large Language Models for Smart Contract Vulnerability Detection and Explanation

Evaluation of ChatGPT's Smart Contract Auditing Capabilities Based on Chain of Thought