Abstract:Semantic code search technology allows searching for existing code snippets through natural language, which can greatly improve programming efficiency. Smart contracts, programs that run on the blockchain, have a code reuse rate of more than 90%, which means developers have a great demand for semantic code search tools. However, the existing code search models still have a semantic gap between code and query, and perform poorly on specialized queries of smart contracts. In this paper, we propose a Multi-Modal Smart contract Code Search (MM-SCS) model. Specifically, we construct a Contract Elements Dependency Graph (CEDG) for MM-SCS as an additional modality to capture the data-flow and control-flow information of the code. To make the model more focused on the key contextual information, we use a multi-head attention network to generate embeddings for code features. In addition, we use a fine-tuned pretrained model to ensure the model's effectiveness when the training data is small. We compared MM-SCS with four state-of-the-art models on a dataset with 470K (code, docstring) pairs collected from Github and Etherscan. Experimental results show that MM-SCS achieves an MRR (Mean Reciprocal Rank) of 0.572, outperforming four state-of-the-art models UNIF, DeepCS, CARLCS-CNN, and TAB-CS by 34.2%, 59.3%, 36.8%, and 14.1%, respectively. Additionally, the search speed of MM-SCS is second only to UNIF, reaching 0.34s/query.
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two main challenges in intelligent contract semantic code search:
1. **Semantic Gap**: Existing code search models have a large semantic gap between queries and codes. Especially when dealing with queries in the specific domain of intelligent contracts, they perform poorly. Intelligent contract codes have unique structures and characteristics (such as control flow, data flow, etc.), and existing models fail to fully capture this information, resulting in inaccurate search results.
2. **Insufficient Training Data**: Most advanced neural code search models require a large number of (code, query) pairs as training data. However, in blockchain applications, especially for intelligent contracts in the Solidity language, there is not a large enough public corpus for training, which limits the application of these models.
To solve these problems, the author proposes a multi - modal intelligent contract code search model (Multi - Modal Smart Contract Code Search, MM - SCS). Specifically, MM - SCS improves the effect of intelligent contract code search through the following three aspects:
1. **Extra Modality**:
- The Contract Elements Dependency Graph (CEDG) is proposed, which integrates control flow and data flow information into one graph to better capture code structures and dependencies. CEDG not only simplifies the graph structure but also highlights the dependencies between code elements, which is helpful for learning key semantic features.
2. **Code Embedding Mechanisms**:
- Use multi - head self - attention networks to embed three text modalities (code tokens, function names, and API sequences), and embed CEDG through a modified graph attention network. The multi - head self - attention mechanism can effectively learn different features from different heads and emphasize important features.
3. **Pretrained Model**:
- Adopt the fine - tuned ALBERT model as a query encoder to deal with limited training data. The ALBERT model has been pre - trained on a large - scale corpus and can achieve better performance with a small amount of training data.
Through these improvements, MM - SCS performs excellently in experiments and significantly outperforms four existing advanced models (UNIF, DeepCS, CARLCS - CNN, and TAB - CS). In particular, it improves by 34.2%, 59.3%, 36.8%, and 14.1% respectively in the Mean Reciprocal Rank (MRR) metric. In addition, the search speed of MM - SCS also reaches 0.34 seconds per query, second only to the UNIF model.
In conclusion, this paper aims to solve the semantic gap and insufficient training data problems in intelligent contract code search by introducing new graph representation methods and improved embedding mechanisms, thereby improving the accuracy and efficiency of search.