Input Transformation for Pre-Trained-Model-Based Cross-Language Code Search

Mingyang Geng,Dezun Dong,Pingjing Lu
DOI: https://doi.org/10.1109/qrs-c60940.2023.00021
2023-01-01
Abstract:Cross-language code-to-code search has great po-tential to boost software development and software mainte-nance. However, performing this task is nontrivial since it requires to accurately understand the semantics of code written in different programming languages. To address this challenge, a natural idea is to leverage the power of pretrained language models trained on diverse languages and have shown great potential in producing high-quality representations for code across different languages. A dominating way of utilizing pretrained models is to directly use code token sequences as the inputs, due to their Transformer-based architectures. Nonetheless, beyond the lexical information, code snippets inherently contain rich semantic information, which may not be adequately captured through the token sequence. To overcome this limitation, we propose an input transformation approach that, given a code snippet, can generate a sequence with semantic information as the input to the pretrained model, which enables us to effectively obtain the representations of the code. Our key insight is that code snippets in different languages that implement the identical functionality, although may differ significantly with respect to the token sequences or the syntactic structures, could share certain similarities regarding to their Program Dependency Graphs (PDGs). Therefore, instead of directly using the token sequence, we propose to first build the semantic graph that can model the semantics of code in different languages based on the data flow and control flow information by optimizing the PDGs. After that, a graph to sequence transformation module is designed and the final transformation result can be obtained. Finally, the contrastive learning is exploited to fine-tune the model. Our large-scale evaluation results show that our method can achieve promising effectiveness because it consistently outperforms the state-of-the-art C4 approach by at least 6% with respect to the Mean Reciprocal Rank (MRR) value, under six different settings.
What problem does this paper attempt to address?