Evaluation of ChatGPT's Smart Contract Auditing Capabilities Based on Chain of Thought

Yuying Du,Xueyan Tang
2024-02-19
Abstract:Smart contracts, as a key component of blockchain technology, play a crucial role in ensuring the automation of transactions and adherence to protocol rules. However, smart contracts are susceptible to security vulnerabilities, which, if exploited, can lead to significant asset losses. This study explores the potential of enhancing smart contract security audits using the GPT-4 model. We utilized a dataset of 35 smart contracts from the SolidiFI-benchmark vulnerability library, containing 732 vulnerabilities, and compared it with five other vulnerability detection tools to evaluate GPT-4's ability to identify seven common types of vulnerabilities. Moreover, we assessed GPT-4's performance in code parsing and vulnerability capture by simulating a professional auditor's auditing process using CoT(Chain of Thought) prompts based on the audit reports of eight groups of smart contracts. We also evaluated GPT-4's ability to write Solidity Proof of Concepts (PoCs). Through experimentation, we found that GPT-4 performed poorly in detecting smart contract vulnerabilities, with a high Precision of 96.6%, but a low Recall of 37.8%, and an F1-score of 41.1%, indicating a tendency to miss vulnerabilities during detection. Meanwhile, it demonstrated good contract code parsing capabilities, with an average comprehensive score of 6.5, capable of identifying the background information and functional relationships of smart contracts; in 60% of the cases, it could write usable PoCs, suggesting GPT-4 has significant potential application in PoC writing. These experimental results indicate that GPT-4 lacks the ability to detect smart contract vulnerabilities effectively, but its performance in contract code parsing and PoC writing demonstrates its significant potential as an auxiliary tool in enhancing the efficiency and effectiveness of smart contract security audits.
Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs), especially GPT - 4, in smart contract auditing. Specifically, the research aims to explore GPT - 4's capabilities in the following aspects: 1. **Detecting smart contract vulnerabilities**: Evaluate GPT - 4's performance in identifying seven common types of vulnerabilities in smart contracts (such as overflow/underflow, re - entry attacks, timestamp dependence, etc.). 2. **Parsing smart contract code**: Test whether GPT - 4 can accurately understand and analyze the business logic, background information, and code structure of smart contracts. 3. **Writing proof - of - concept (PoC) for vulnerability verification**: Examine GPT - 4's ability to write Solidity PoC to verify the security vulnerabilities it identifies. ### Specific problem description As a key component of blockchain technology, smart contracts ensure the automated execution of transactions and compliance with protocol rules. However, smart contracts are vulnerable to security vulnerabilities, and once exploited, they may lead to significant asset losses. In recent years, multiple smart contract attack incidents have highlighted the importance of smart contract security auditing. The research evaluates GPT - 4's capabilities in smart contract auditing in the following ways: - **Dataset selection**: Use 35 smart contracts from the SolidiFI - benchmark dataset containing 732 known vulnerabilities for testing. - **Experimental design**: - **Vulnerability detection**: Compare GPT - 4's performance with five other vulnerability detection tools and evaluate its ability to detect seven common vulnerability types. - **Code parsing**: Based on the audit reports of eight smart contracts, simulate the audit process of professional auditors to evaluate GPT - 4's code - parsing ability and vulnerability - identification ability. - **PoC writing**: Test GPT - 4's performance in writing PoC, including its ability to generate PoC with and without prompts. ### Experimental results The experimental results show that GPT - 4 exhibits a relatively high Precision (96.6%) in vulnerability detection, but a lower Recall (37.8%), with an F1 - score of 41.1%, indicating that it has the problem of under - reporting when detecting vulnerabilities. Nevertheless, GPT - 4 shows certain potential in code parsing and PoC writing. In particular, in terms of writing PoC, it can generate usable PoC in 60% of cases. ### Conclusion Although GPT - 4 performs poorly in smart contract vulnerability detection, it has significant application potential in code parsing and PoC writing and can be used as an auxiliary tool to improve the efficiency and effectiveness of smart contract security auditing. These results are helpful for better understanding and developing the application of artificial intelligence in the field of smart contract security and provide references for the formulation of network security technologies and strategies. ### Related formulas - **Precision (Precision rate)**: \[ \text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}} \] - **Recall (Recall rate)**: \[ \text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}} \] - **Accuracy (Accuracy rate)**: \[ \text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FP}+\text{FN}+\text{TN}} \] - **F1 - score (F1 - fraction)**: \[ \text{F1 - score}=2\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}} \] where TP represents true positives, FP represents false positives, FN represents false negatives, and TN represents true negatives.