IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

Indraneil Paul,Goran Glavaš,Iryna Gurevych
2024-04-16
Abstract:Code understanding and generation have fast become some of the most popular applications of language models (LMs). Nonetheless, research on multilingual aspects of Code-LMs (i.e., LMs for code generation) such as cross-lingual transfer between different programming languages, language-specific data augmentation, and post-hoc LM adaptation, alongside exploitation of data sources other than the original textual content, has been much sparser than for their natural language counterparts. In particular, most mainstream Code-LMs have been pre-trained on source code files alone. In this work, we investigate the prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer.
Artificial Intelligence,Computation and Language,Programming Languages
What problem does this paper attempt to address?