Abstract:Code classification and code clone detection are crucial for understanding and maintaining large software systems. Although deep learning surpasses traditional techniques in capturing the features of source code, existing models suffer from low processing power and high complexity. We propose a novel source code representation method based on the multi-head attention mechanism (SCRMHA). SCRMHA captures the vector representation of entire code segments, enabling it to focus on different positions of the input sequence, capture richer semantic information, and simultaneously process different aspects and relationships of the sequence. Moreover, it can calculate multiple attention heads in parallel, speeding up the computational process. We evaluate SCRMHA on both the standard dataset and an actual industrial dataset, and analyze the differences between these two datasets. Experiment results in code classification and clone detection tasks show that SCRMHA consumes less time and reduces complexity by about one-third compared with traditional source code feature representation methods. The results demonstrate that SCRMHA reduces the computational complexity and time consumption of the model while maintaining accuracy.

What problem does this paper attempt to address?

This paper mainly discusses how to use Multi-Head Attention mechanism to propose a new source code representation method called SCRMHA (Source Code Representation based on Multi-Head Attention). The background of this research is the importance of code classification and code clone detection in understanding and maintaining large software systems, while existing methods face challenges in handling complexity and efficiency. The authors observed that although deep learning has surpassed traditional techniques in capturing source code features, existing models face high computational complexity and limited parallel processing capabilities when dealing with large-scale datasets. To address these issues, they proposed SCRMHA, which can parallelly process code snippets using the Multi-Head Attention mechanism, focusing on different positions of the input sequence, capturing richer semantic information, and handling multiple aspects and relationships of the sequence. This approach can accelerate the computation process, reducing time and complexity. SCRMHA first converts code snippets into Abstract Syntax Trees (ASTs), then segments them into small statement tree sequences, encodes these sequences into vectors using a statement encoder, and finally applies the Multi-Head Attention mechanism to extract features of the entire code snippet. Experimental results show that SCRMHA is faster and reduces complexity by about one-third compared to traditional source code representation methods in code classification and clone detection tasks while maintaining accuracy. The paper also reviews related work, including methods based on text, token sequences, ASTs, PDG, and CFG, as well as the recent applications of neural networks in code representation. The authors compare SCRMHA with traditional methods and evaluate them on standard datasets and real-world industrial datasets, analyzing the differences between the two. In summary, the main contributions of this paper are: 1. The proposal of a source code representation method called SCRMHA based on Multi-Head Attention, which significantly improves the training and inference speed of the model on large-scale datasets and reduces complexity. 2. The implementation of SCRMHA, effectively utilizing parallel computing resources, overcoming the limitations of existing models in parallel processing capabilities. 3. The evaluation of SCRMHA on standard datasets and real-world industrial datasets, demonstrating its superiority in performance, higher time efficiency, and simplicity.

A Novel Source Code Representation Approach Based on Multi-Head Attention

Code Clone Detection: A Literature Review

The Source Code Comment Generation Based on Deep Reinforcement Learning and Hierarchical Attention

Multi-modal Attention Network Learning for Semantic Source Code Retrieval

Sparse Attention-Based Neural Networks for Code Classification

XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training

Deep Neural Network with Attention Model for Scene Text Recognition.

Neural Detection of Semantic Code Clones Via Tree-Based Convolution

CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

Towards Modeling Human Attention from Eye Movements for Neural Source Code Summarization

Automatic Source Code Summarization with Graph Attention Networks

Challenging Machine Learning-based Clone Detectors via Semantic-preserving Code Transformations

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

An ensemble learning approach for software semantic clone detection

ZC3: Zero-Shot Cross-Language Code Clone Detection

A Neural Network Based Intelligent Support Model for Program Code Completion

CONCORD: Clone-aware Contrastive Learning for Source Code

A novel code representation for detecting Java code clones using high-level and abstract compiled code representations

Extracting Meaningful Attention on Source Code: An Empirical Study of Developer and Neural Model Code Exploration

Learn to Align - A Code Alignment Network for Code Clone Detection.