Abstract:Word representation plays a key role in natural language processing (NLP). Various representation methods have been developed, among which pre-trained word embeddings (i.e., dense vectors that represent words) have shown to be highly effective in many neural network-based NLP applications, such as named entity recognition (NER) and part-of-speech (POS) tagging. However, the use of pre-trained code embeddings for software engineering (SE) tasks has not been extensively explored. A recent study by Kang et al. (2019) finds that code embeddings may not be readily leveraged for the downstream tasks that the embeddings are not trained for. However, Kang et al. (2019) only evaluate two code embedding approaches on three downstream tasks and both approaches may have not taken full advantage of the context information in the code when training code embeddings. Considering the limitations of the evaluated embedding techniques and downstream tasks in Kang et al. (2019), we would like to revisit the prior study by examining whether the lack of generalizability of pre-trained code embeddings can be addressed by considering both the textual and structural information of the code and using unsupervised learning. Therefore, in this paper, we propose a framework, StrucTexVec, which uses a two-step unsupervised training strategy to incorporate the textual and structural information of the code. Then, we extend prior work (Kang et al. 2019) by evaluating seven code embedding techniques and comparing them with models that do not utilize pre-trained embeddings in six downstream tasks. Our results first confirm the findings from prior work, i.e., pre-trained embeddings may not always have a significant effect on the performance of downstream SE tasks. Nevertheless, we also observe that (1) different embedding techniques can result in diverse performance for some SE tasks; (2) using well pre-trained embeddings usually improve the performance of SE tasks (e.g., all six downstream tasks in our study); and (3) the structural context has a non-negligible impact on improving the quality of code embeddings (e.g., embedding approaches that leverage the structural context achieve the best performance in five out of six downstream tasks among all the evaluated non-contextual embeddings), and thus, future work can consider incorporating such information into the large pre-trained models. Our findings imply the importance and effectiveness of combining both textual and structural context in creating code embeddings. Moreover, one should be very careful with the selection of code embedding techniques for different downstream tasks, as it may be difficult to prescribe a single best-performing solution for all SE tasks.

TransformCode: A Contrastive Learning Framework for Code Embedding Via Subtree Transformation

A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings

CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search

PinNet: Pinpoint Instructive Information for Retrieval Augmented Code-to-Text Generation

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills.

XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training

CodeRetriever: Unimodal and Bimodal Contrastive Learning for Code Search

A new approach for encoding code and assisting code understanding

StructCoder: Structure-Aware Transformer for Code Generation

SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings

Code Representation Learning At Scale

CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

How to get better embeddings with code pre-trained models? An empirical study

Exploring Representation-Level Augmentation for Code Search

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

kTrans: Knowledge-Aware Transformer for Binary Code Embedding

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks