Exploring Multi-Lingual Bias of Large Code Models in Code Generation

Chaozheng Wang,Zongjie Li,Cuiyun Gao,Wenxuan Wang,Ting Peng,Hailiang Huang,Yuetang Deng,Shuai Wang,Michael R. Lyu
2024-04-30
Abstract:Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications, which can greatly improve development efficiency. In the era of large language models (LLMs), large code models (LCMs) have been recently proposed to generate source code. LCMs can generate highly feasible solutions for programming problems described in natural language. Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of LCMs. Specifically, LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese. Moreover, the ability of LCMs to generate code exhibits variety across different programming languages (PLs), such as Python and C++. The observed phenomenon indicates the presence of multi-lingual bias within the generative capabilities of LCMs, which has remained unexplored.
Software Engineering
What problem does this paper attempt to address?
This paper aims to explore the issue of multilingual bias in large code models (LCMs) in code generation tasks. Specifically, researchers observed that there are significant differences in code generation performance when LCMs are given instructions in different natural languages (such as English and Chinese). Moreover, even within the same natural language, LCMs exhibit different performance levels when handling different programming languages (e.g., Python, C++, and Java). To systematically evaluate the multilingual bias present in current LCMs, the research team constructed the first multilingual evaluation benchmark, X-HumanEval-X, and conducted large-scale experiments on nine popular LCMs based on this benchmark. The experimental results show that when using Chinese instructions, the code generation capability of LCMs decreases by at least 13% compared to English instructions, and the performance gap between different programming languages can be as high as 23.7%. In response to this finding, researchers further explored methods to mitigate multilingual bias through prompting, including different strategies for translating Chinese instructions into English. The results indicate that using third-party translation tools for single-step or multi-step translation can effectively reduce multilingual bias, while self-translation may lead to significant performance degradation or even exacerbate the bias. Finally, researchers also alleviated multilingual bias through instruction tuning, which improved the performance of LCMs in code generation tasks, especially when increasing the diversity of natural languages and programming languages in the training data. In summary, this paper is the first to deeply investigate the multilingual bias in LCMs for code generation tasks and proposes effective mitigation strategies.