Abstract:Cross-language code-to-code search has great po-tential to boost software development and software mainte-nance. However, performing this task is nontrivial since it requires to accurately understand the semantics of code written in different programming languages. To address this challenge, a natural idea is to leverage the power of pretrained language models trained on diverse languages and have shown great potential in producing high-quality representations for code across different languages. A dominating way of utilizing pretrained models is to directly use code token sequences as the inputs, due to their Transformer-based architectures. Nonetheless, beyond the lexical information, code snippets inherently contain rich semantic information, which may not be adequately captured through the token sequence. To overcome this limitation, we propose an input transformation approach that, given a code snippet, can generate a sequence with semantic information as the input to the pretrained model, which enables us to effectively obtain the representations of the code. Our key insight is that code snippets in different languages that implement the identical functionality, although may differ significantly with respect to the token sequences or the syntactic structures, could share certain similarities regarding to their Program Dependency Graphs (PDGs). Therefore, instead of directly using the token sequence, we propose to first build the semantic graph that can model the semantics of code in different languages based on the data flow and control flow information by optimizing the PDGs. After that, a graph to sequence transformation module is designed and the final transformation result can be obtained. Finally, the contrastive learning is exploited to fine-tune the model. Our large-scale evaluation results show that our method can achieve promising effectiveness because it consistently outperforms the state-of-the-art C4 approach by at least 6% with respect to the Mean Reciprocal Rank (MRR) value, under six different settings.

Code Execution with Pre-trained Language Models

Large Language Models as Code Executors: An Exploratory Study

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

What do pre-trained code models know about code?

Natural Language to Code Translation with Execution

A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Towards Understanding What Code Language Models Learned

INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers

Code Representation Pre-training with Complements from Program Executions

CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing

Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities

Input Transformation for Pre-Trained-Model-Based Cross-Language Code Search

Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code

Understanding Code Semantics: An Evaluation of Transformer Models in Summarization

LExecutor: Learning-Guided Execution

Better Language Models of Code through Self-Improvement

Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code

Represent Code As Action Sequence for Predicting Next Method Call