Abstract:Recent advances in deep learning have significantly improved the understanding of source code by leveraging large amounts of open-source software data. Thanks to the larger amount of data, code representation models trained with multilingual datasets show superior performance to monolingual models and attract much more attention. However, the entangled source code from various programming languages makes multilingual models hard to differentiate language-specific textual semantics or syntactic structures, which significantly increases the difficulty of model learning from multilingual datasets directly. On the other hand, for a given problem, developers are likely to choose similar identifiers, even if coding in different languages. However, the presence of similar identifiers in multilingual code snippets does not mean that they implement the same functionality, which may misdirect models to overemphasize these unreliable signals and ignore the semantic information of multilingual code. To tackle the above issues, we propose LAMCode, a language-aware multilingual code understanding model. Specifically, we propose a simple yet effective method to perceive linguistic information by injecting language-specific viewer into the language models. Furthermore, we introduce a cross-lingual contrastive learning method by generating more similar training instances but with fewer overlapping features. This method prevents the models from over-relying on similar identifiers across languages. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale multilingual dataset. The experimental results show that our approach significantly outperforms the state-of-the-art methods.

Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning

Iterative Task-adaptive Pretraining for Unsupervised Word Alignment

Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Improving Multi-lingual Alignment Through Soft Contrastive Learning

Towards Multi-Sense Cross-Lingual Alignment of Contextual Embeddings

Jointly Learning Bilingual Word Embeddings and Alignments

MULTI-LEVEL CONTRASTIVE LEARNING FOR CROSS-LINGUAL ALIGNMENT

Understanding Cross-Lingual Alignment -- A Survey

Unsupervised Deep Cross-Language Entity Alignment

Towards Better Multilingual Code Search Through Cross-Lingual Contrastive Learning.

Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

HC$^2$L: Hybrid and Cooperative Contrastive Learning for Cross-lingual Spoken Language Understanding

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Explicit Alignment Objectives for Multilingual Bidirectional Encoders

SCMEA: A stacked co-enhanced model for entity alignment based on multi-aspect information fusion and bidirectional contrastive learning

Multi-level Contrastive Learning for Cross-lingual Spoken Language Understanding

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

Cross-lingual Alignment Methods for Multilingual BERT: A Comparative Study

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

A Supervised Word Alignment Method based on Cross-Language Span Prediction using Multilingual BERT