Bias Testing and Mitigation in LLM-based Code Generation

Dong Huang,Qingwen Bu,Jie Zhang,Xiaofei Xie,Junjie Chen,Heming Cui
2024-05-24
Abstract:Utilizing state-of-the-art Large Language Models (LLMs), automatic code generation models play a pivotal role in enhancing the productivity of software development procedures. As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models, yet is under-explored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive evaluation of the bias in code generated by five state-of-the-art LLMs. Our findings reveal that 20.29% to 44.93% code functions generated by the models under study are biased when handling bias sensitive tasks (i.e., tasks that involve sensitive attributes such as age and gender). This indicates that the existing LLMs can be unfair in code generation, posing risks of unintended and harmful software behaviors. To mitigate bias for code generation models, we evaluate five bias mitigation prompt strategies, i.e., utilizing bias testing results to refine the code (zero-shot), one-, few-shot, and two Chain-of-Thought (CoT) prompts. Our evaluation results illustrate that these strategies are all effective in mitigating bias. Overall, one-shot and few-shot learning are the two most effective. For GPT-4, 80% to 90% code bias can be removed with one-shot learning.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper primarily focuses on the issues of social bias and social unfairness present when large language models (LLMs), widely adopted in the software coding ecosystem, generate code. Specifically, it addresses biases related to age, gender, and race. The paper attempts to solve the following key problems: 1. **Do LLMs generate biased code when handling sensitive tasks?** - Investigate whether LLMs exhibit biases towards specific attributes (such as gender, age, etc.) when generating code. 2. **Is the designed bias testing method reliable in identifying biases in code?** - Validate whether the proposed bias testing framework can effectively detect biases in code. 3. **How effective is prompt engineering in mitigating biases in code generation?** - Explore the effectiveness of various prompt engineering strategies (zero-shot, one-shot, few-shot learning, and chain-of-thought) in reducing or eliminating biases in generated code. The paper proposes a novel bias testing framework specifically for code generation tasks and uses this framework to conduct a thorough evaluation of five state-of-the-art LLMs, finding that biases are prevalent. Additionally, the study explores common bias mitigation prompt strategies and finds that while directly using these strategies has limited effectiveness, combining them with test feedback can significantly reduce the proportion of biases.