Abstract:While there has been a recent burgeoning of applications at the intersection of natural and programming languages, such as code generation and code summarization, these applications are usually English-centric. This creates a barrier for program developers who are not proficient in English. To mitigate this gap in technology development across languages, we propose a multilingual dataset, MCoNaLa, to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNaLa) dataset, we annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian. We present a quantitative evaluation of performance on the MCoNaLa dataset by testing with state-of-the-art code generation systems. While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts, revealing the challenges in adapting code generation to new languages.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the natural - language - to - code generation task, the existing benchmark datasets and research mainly focus on English, which creates an obstacle for programmers who are not good at English. To alleviate the gap in technological development between different languages, the authors propose a multilingual dataset MCoNaLa for benchmarking code generation from multiple natural - language commands. Specifically, the MCoNaLa dataset extends the methodology of the existing English Code/Natural Language Challenge (CoNaLa) dataset and annotates a total of 896 NL - Code pairs, covering Spanish, Japanese and Russian. Through this dataset, the authors hope to evaluate the performance of existing code - generation systems in a multilingual environment and reveal the challenges of adapting code - generation capabilities to new languages. ### Main contributions of the paper: 1. **Constructing a multilingual dataset**: The MCoNaLa dataset contains natural - language commands in Spanish, Japanese and Russian and their corresponding Python code snippets, filling the gap in the multilingual code - generation field. 2. **Systematic evaluation**: The authors conduct a systematic evaluation of the MCoNaLa dataset, testing the performance of the state - of - the - art code - generation systems on different languages and finding that the performance of these systems on non - English languages is significantly lower than that on English. 3. **Challenges of cross - language transfer**: The paper demonstrates the difficulties of cross - language natural - language - to - code generation and emphasizes the importance of developing comprehensive language - processing methods. ### Structure of the paper: - **Introduction**: Introduces the development background of code - intelligence applications, especially the current situation and challenges of the natural - language - to - code generation task. - **MCoNaLa dataset**: Describes in detail the construction process of the dataset, including the selection of language sources, the identification of valid posts and the annotation of parallel samples. - **Methods**: Introduces three training and testing settings (translate - train, translate - test, zero - shot) and three baseline models (MBART, TRANX, TAE). - **Experiments**: Reports the experimental results and analyzes the performance differences under different languages and settings. - **Analysis**: Explores the differences between different languages and the impact of the quality of automatic translation on the experimental results. - **Conclusion**: Summarizes the main findings of the research and points out the directions for future work. ### Main findings: - **Performance differences**: Among all languages and experimental settings, the highest average BLEU score is only 7.28, far lower than 33.41 for English, indicating that English has a higher similarity to Python, while other languages face greater challenges. - **Model performance**: Models with specific - task modules and training (such as TRANX and TAE) perform better in most cases, but the performance differences between different languages are still significant. - **Translation quality**: The quality of automatic translation has an important impact on the experimental results, especially in the translation of key verbs, where errors or omissions may lead to alignment problems between intentions and code snippets. Through these contributions, this paper not only provides an important multilingual benchmark dataset, but also provides valuable references and inspiration for future research.

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Code Generation from Natural Language Using Two-Way Pre-Training

NoviCode: Generating Programs from Natural Language Utterances by Novices

Large Language Models Meet NL2Code: A Survey

Improving Natural Language Capability of Code Large Language Model

Uncovering Weaknesses in Neural Code Generation

NMT-Based Code Generation for Coding Assistance with Natural Language

Exploring Multi-Lingual Bias of Large Code Models in Code Generation

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

The Good, the Bad, and the Missing: Neural Code Generation for Machine Learning Tasks

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Neural Machine Translation for Code Generation

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions

Execution-Based Evaluation for Open-Domain Code Generation

A comprehensive review of State-of-The-Art methods for Java code generation from Natural Language Text