MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

Zhiruo Wang,Grace Cuenca,Shuyan Zhou,Frank F. Xu,Graham Neubig
DOI: https://doi.org/10.48550/arXiv.2203.08388
2023-02-07
Abstract:While there has been a recent burgeoning of applications at the intersection of natural and programming languages, such as code generation and code summarization, these applications are usually English-centric. This creates a barrier for program developers who are not proficient in English. To mitigate this gap in technology development across languages, we propose a multilingual dataset, MCoNaLa, to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNaLa) dataset, we annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian. We present a quantitative evaluation of performance on the MCoNaLa dataset by testing with state-of-the-art code generation systems. While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts, revealing the challenges in adapting code generation to new languages.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the natural - language - to - code generation task, the existing benchmark datasets and research mainly focus on English, which creates an obstacle for programmers who are not good at English. To alleviate the gap in technological development between different languages, the authors propose a multilingual dataset MCoNaLa for benchmarking code generation from multiple natural - language commands. Specifically, the MCoNaLa dataset extends the methodology of the existing English Code/Natural Language Challenge (CoNaLa) dataset and annotates a total of 896 NL - Code pairs, covering Spanish, Japanese and Russian. Through this dataset, the authors hope to evaluate the performance of existing code - generation systems in a multilingual environment and reveal the challenges of adapting code - generation capabilities to new languages. ### Main contributions of the paper: 1. **Constructing a multilingual dataset**: The MCoNaLa dataset contains natural - language commands in Spanish, Japanese and Russian and their corresponding Python code snippets, filling the gap in the multilingual code - generation field. 2. **Systematic evaluation**: The authors conduct a systematic evaluation of the MCoNaLa dataset, testing the performance of the state - of - the - art code - generation systems on different languages and finding that the performance of these systems on non - English languages is significantly lower than that on English. 3. **Challenges of cross - language transfer**: The paper demonstrates the difficulties of cross - language natural - language - to - code generation and emphasizes the importance of developing comprehensive language - processing methods. ### Structure of the paper: - **Introduction**: Introduces the development background of code - intelligence applications, especially the current situation and challenges of the natural - language - to - code generation task. - **MCoNaLa dataset**: Describes in detail the construction process of the dataset, including the selection of language sources, the identification of valid posts and the annotation of parallel samples. - **Methods**: Introduces three training and testing settings (translate - train, translate - test, zero - shot) and three baseline models (MBART, TRANX, TAE). - **Experiments**: Reports the experimental results and analyzes the performance differences under different languages and settings. - **Analysis**: Explores the differences between different languages and the impact of the quality of automatic translation on the experimental results. - **Conclusion**: Summarizes the main findings of the research and points out the directions for future work. ### Main findings: - **Performance differences**: Among all languages and experimental settings, the highest average BLEU score is only 7.28, far lower than 33.41 for English, indicating that English has a higher similarity to Python, while other languages face greater challenges. - **Model performance**: Models with specific - task modules and training (such as TRANX and TAE) perform better in most cases, but the performance differences between different languages are still significant. - **Translation quality**: The quality of automatic translation has an important impact on the experimental results, especially in the translation of key verbs, where errors or omissions may lead to alignment problems between intentions and code snippets. Through these contributions, this paper not only provides an important multilingual benchmark dataset, but also provides valuable references and inspiration for future research.