TransformCode: A Contrastive Learning Framework for Code Embedding Via Subtree Transformation

Zixiang Xian,Rubing Huang,Dave Towey,Chunrong Fang,Zhenyu Chen
DOI: https://doi.org/10.1109/tse.2024.3393419
IF: 7.4
2024-01-01
IEEE Transactions on Software Engineering
Abstract:Artificial intelligence (AI) has revolutionized software engineering (SE) byenhancing software development efficiency. The advent of pre-trained models(PTMs) leveraging transfer learning has significantly advanced AI for SE.However, existing PTMs that operate on individual code tokens suffer fromseveral limitations: They are costly to train and fine-tune; and they relyheavily on labeled data for fine-tuning on task-specific datasets. In thispaper, we present TransformCode, a novel framework that learns code embeddingsin a contrastive learning manner. Our framework is encoder-agnostic andlanguage-agnostic, which means that it can leverage any encoder model andhandle any programming language. We also propose a novel data-augmentationtechnique called abstract syntax tree (AST) transformation, which appliessyntactic and semantic transformations to the original code snippets, togenerate more diverse and robust samples for contrastive learning. Ourframework has several advantages over existing methods: (1) It is flexible andadaptable, because it can easily be extended to other downstream tasks thatrequire code representation (such as code-clone detection and classification);(2) it is efficient and scalable, because it does not require a large model ora large amount of training data, and it can support any programming language;(3) it is not limited to unsupervised learning, but can also be applied to somesupervised learning tasks by incorporating task-specific labels or objectives;and (4) it can also adjust the number of encoder parameters based on computingresources. We evaluate our framework on several code-related tasks, anddemonstrate its effectiveness and superiority over the state-of-the-art methodssuch as SourcererCC, Code2vec, and InferCode.
What problem does this paper attempt to address?