Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks

Zishuo Ding,Heng Li,Weiyi Shang,Tse-Hsun Peter Chen
DOI: https://doi.org/10.1007/s10664-022-10118-5
IF: 3.762
2022-03-17
Empirical Software Engineering
Abstract:Word representation plays a key role in natural language processing (NLP). Various representation methods have been developed, among which pre-trained word embeddings (i.e., dense vectors that represent words) have shown to be highly effective in many neural network-based NLP applications, such as named entity recognition (NER) and part-of-speech (POS) tagging. However, the use of pre-trained code embeddings for software engineering (SE) tasks has not been extensively explored. A recent study by Kang et al. (2019) finds that code embeddings may not be readily leveraged for the downstream tasks that the embeddings are not trained for. However, Kang et al. (2019) only evaluate two code embedding approaches on three downstream tasks and both approaches may have not taken full advantage of the context information in the code when training code embeddings. Considering the limitations of the evaluated embedding techniques and downstream tasks in Kang et al. (2019), we would like to revisit the prior study by examining whether the lack of generalizability of pre-trained code embeddings can be addressed by considering both the textual and structural information of the code and using unsupervised learning. Therefore, in this paper, we propose a framework, StrucTexVec, which uses a two-step unsupervised training strategy to incorporate the textual and structural information of the code. Then, we extend prior work (Kang et al. 2019) by evaluating seven code embedding techniques and comparing them with models that do not utilize pre-trained embeddings in six downstream tasks. Our results first confirm the findings from prior work, i.e., pre-trained embeddings may not always have a significant effect on the performance of downstream SE tasks. Nevertheless, we also observe that (1) different embedding techniques can result in diverse performance for some SE tasks; (2) using well pre-trained embeddings usually improve the performance of SE tasks (e.g., all six downstream tasks in our study); and (3) the structural context has a non-negligible impact on improving the quality of code embeddings (e.g., embedding approaches that leverage the structural context achieve the best performance in five out of six downstream tasks among all the evaluated non-contextual embeddings), and thus, future work can consider incorporating such information into the large pre-trained models. Our findings imply the importance and effectiveness of combining both textual and structural context in creating code embeddings. Moreover, one should be very careful with the selection of code embedding techniques for different downstream tasks, as it may be difficult to prescribe a single best-performing solution for all SE tasks.
computer science, software engineering
What problem does this paper attempt to address?