Exploring Multi-Lingual Bias of Large Code Models in Code Generation

Chaozheng Wang,Zongjie Li,Cuiyun Gao,Wenxuan Wang,Ting Peng,Hailiang Huang,Yuetang Deng,Shuai Wang,Michael R. Lyu

2024-04-30

Abstract:Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications, which can greatly improve development efficiency. In the era of large language models (LLMs), large code models (LCMs) have been recently proposed to generate source code. LCMs can generate highly feasible solutions for programming problems described in natural language. Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of LCMs. Specifically, LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese. Moreover, the ability of LCMs to generate code exhibits variety across different programming languages (PLs), such as Python and C++. The observed phenomenon indicates the presence of multi-lingual bias within the generative capabilities of LCMs, which has remained unexplored.

Software Engineering

What problem does this paper attempt to address?

This paper aims to explore the issue of multilingual bias in large code models (LCMs) in code generation tasks. Specifically, researchers observed that there are significant differences in code generation performance when LCMs are given instructions in different natural languages (such as English and Chinese). Moreover, even within the same natural language, LCMs exhibit different performance levels when handling different programming languages (e.g., Python, C++, and Java). To systematically evaluate the multilingual bias present in current LCMs, the research team constructed the first multilingual evaluation benchmark, X-HumanEval-X, and conducted large-scale experiments on nine popular LCMs based on this benchmark. The experimental results show that when using Chinese instructions, the code generation capability of LCMs decreases by at least 13% compared to English instructions, and the performance gap between different programming languages can be as high as 23.7%. In response to this finding, researchers further explored methods to mitigate multilingual bias through prompting, including different strategies for translating Chinese instructions into English. The results indicate that using third-party translation tools for single-step or multi-step translation can effectively reduce multilingual bias, while self-translation may lead to significant performance degradation or even exacerbate the bias. Finally, researchers also alleviated multilingual bias through instruction tuning, which improved the performance of LCMs in code generation tasks, especially when increasing the diversity of natural languages and programming languages in the training data. In summary, this paper is the first to deeply investigate the multilingual bias in LCMs for code generation tasks and proposes effective mitigation strategies.

Exploring Multi-Lingual Bias of Large Code Models in Code Generation

Bias Assessment and Mitigation in LLM-based Code Generation

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Bias Testing and Mitigation in LLM-based Code Generation

Improving Natural Language Capability of Code Large Language Model

From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions

Where Do Large Language Models Fail When Generating Code?

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

A Survey on Large Language Models for Code Generation

A Survey on Evaluating Large Language Models in Code Generation Tasks

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

Evaluating Large Language Models in Class-Level Code Generation

Multi-Programming Language Ensemble for Code Generation in Large Language Model

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Personality-Guided Code Generation Using Large Language Models

Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models

Can Large Language Models Generate Geospatial Code?

Mitigating Gender Bias in Code Large Language Models via Model Editing