Abstract:Cross-language code-to-code search has great po-tential to boost software development and software mainte-nance. However, performing this task is nontrivial since it requires to accurately understand the semantics of code written in different programming languages. To address this challenge, a natural idea is to leverage the power of pretrained language models trained on diverse languages and have shown great potential in producing high-quality representations for code across different languages. A dominating way of utilizing pretrained models is to directly use code token sequences as the inputs, due to their Transformer-based architectures. Nonetheless, beyond the lexical information, code snippets inherently contain rich semantic information, which may not be adequately captured through the token sequence. To overcome this limitation, we propose an input transformation approach that, given a code snippet, can generate a sequence with semantic information as the input to the pretrained model, which enables us to effectively obtain the representations of the code. Our key insight is that code snippets in different languages that implement the identical functionality, although may differ significantly with respect to the token sequences or the syntactic structures, could share certain similarities regarding to their Program Dependency Graphs (PDGs). Therefore, instead of directly using the token sequence, we propose to first build the semantic graph that can model the semantics of code in different languages based on the data flow and control flow information by optimizing the PDGs. After that, a graph to sequence transformation module is designed and the final transformation result can be obtained. Finally, the contrastive learning is exploited to fine-tune the model. Our large-scale evaluation results show that our method can achieve promising effectiveness because it consistently outperforms the state-of-the-art C4 approach by at least 6% with respect to the Mean Reciprocal Rank (MRR) value, under six different settings.

Input Transformation for Pre-Trained-Model-Based Cross-Language Code Search

XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training

How to Better Utilize Code Graphs in Semantic Code Search?

Towards Better Multilingual Code Search Through Cross-Lingual Contrastive Learning.

Code Search based on Context-aware Code Translation

CCCS: Contrastive Cross-Language Code Search Using Code Graph Information

CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search

Cross-Modal Contrastive Learning for Code Search

TransformCode: A Contrastive Learning Framework for Code Embedding Via Subtree Transformation

MulCS: Towards a Unified Deep Representation for Multilingual Code Search

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills.

A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

I2R: Intra and Inter-Modal Representation Learning for Code Search

Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models

A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer

An Empirical Study on Code Search Pre-trained Models: Academic Progresses Vs. Industry Requirements

MCodeSearcher: Multi-View Contrastive Learning for Code Search.

Code-switching finetuning: Bridging multilingual pretrained language models for enhanced cross-lingual performance

Multi-task learning based pre-trained language model for code completion