A Novel Source Code Representation Approach Based on Multi-Head Attention

Lei Xiao,Hao Zhong,Jianjian Liu,Kaiyu Zhang,Qizhen Xu,Le Chang
DOI: https://doi.org/10.3390/electronics13112111
IF: 2.9
2024-05-30
Electronics
Abstract:Code classification and code clone detection are crucial for understanding and maintaining large software systems. Although deep learning surpasses traditional techniques in capturing the features of source code, existing models suffer from low processing power and high complexity. We propose a novel source code representation method based on the multi-head attention mechanism (SCRMHA). SCRMHA captures the vector representation of entire code segments, enabling it to focus on different positions of the input sequence, capture richer semantic information, and simultaneously process different aspects and relationships of the sequence. Moreover, it can calculate multiple attention heads in parallel, speeding up the computational process. We evaluate SCRMHA on both the standard dataset and an actual industrial dataset, and analyze the differences between these two datasets. Experiment results in code classification and clone detection tasks show that SCRMHA consumes less time and reduces complexity by about one-third compared with traditional source code feature representation methods. The results demonstrate that SCRMHA reduces the computational complexity and time consumption of the model while maintaining accuracy.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
This paper mainly discusses how to use Multi-Head Attention mechanism to propose a new source code representation method called SCRMHA (Source Code Representation based on Multi-Head Attention). The background of this research is the importance of code classification and code clone detection in understanding and maintaining large software systems, while existing methods face challenges in handling complexity and efficiency. The authors observed that although deep learning has surpassed traditional techniques in capturing source code features, existing models face high computational complexity and limited parallel processing capabilities when dealing with large-scale datasets. To address these issues, they proposed SCRMHA, which can parallelly process code snippets using the Multi-Head Attention mechanism, focusing on different positions of the input sequence, capturing richer semantic information, and handling multiple aspects and relationships of the sequence. This approach can accelerate the computation process, reducing time and complexity. SCRMHA first converts code snippets into Abstract Syntax Trees (ASTs), then segments them into small statement tree sequences, encodes these sequences into vectors using a statement encoder, and finally applies the Multi-Head Attention mechanism to extract features of the entire code snippet. Experimental results show that SCRMHA is faster and reduces complexity by about one-third compared to traditional source code representation methods in code classification and clone detection tasks while maintaining accuracy. The paper also reviews related work, including methods based on text, token sequences, ASTs, PDG, and CFG, as well as the recent applications of neural networks in code representation. The authors compare SCRMHA with traditional methods and evaluate them on standard datasets and real-world industrial datasets, analyzing the differences between the two. In summary, the main contributions of this paper are: 1. The proposal of a source code representation method called SCRMHA based on Multi-Head Attention, which significantly improves the training and inference speed of the model on large-scale datasets and reduces complexity. 2. The implementation of SCRMHA, effectively utilizing parallel computing resources, overcoming the limitations of existing models in parallel processing capabilities. 3. The evaluation of SCRMHA on standard datasets and real-world industrial datasets, demonstrating its superiority in performance, higher time efficiency, and simplicity.