The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Aleix Sant,Carlos Escolano,Audrey Mash,Francesca De Luca Fornaciari,Maite Melero
2024-07-26
Abstract:This paper studies gender bias in machine translation through the lens of Large Language Models (LLMs). Four widely-used test sets are employed to benchmark various base LLMs, comparing their translation quality and gender bias against state-of-the-art Neural Machine Translation (NMT) models for English to Catalan (En $\rightarrow$ Ca) and English to Spanish (En $\rightarrow$ Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.
Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Aims to Solve This paper aims to address the issue of gender bias in machine translation and attempts to mitigate this bias through prompt engineering. Specifically: 1. **Research Background**: - Gender bias is prevalent in machine translation systems, potentially leading to unfair representation or resource allocation for certain groups. - Large Language Models (LLMs), despite their excellent performance in natural language processing tasks, have been less studied for gender bias in the field of machine translation. 2. **Main Objectives**: - Benchmark Comparison: Evaluate the translation quality and gender bias of different foundational LLMs against state-of-the-art Neural Machine Translation (NMT) models using widely used test sets (such as FLoRes-200, WinoMT, Gold BUG, and MuST-SHE). - The study finds that foundational LLMs perform worse in terms of gender bias compared to NMT models. - Explore the effectiveness of prompt engineering in mitigating gender bias in LLMs, particularly by applying specific prompt structures to instruction-tuned LLMs to reduce gender bias. 3. **Specific Methods**: - Conduct experiments using various prompt techniques (such as few-shot prompting, providing contextual information, and chain-of-thought instructions) to find the optimal prompt structure that significantly reduces gender bias. - Evaluate the effectiveness of various prompts on the WinoMT dataset and apply the best prompts to other gender bias test sets (Gold BUG, MuST-SHE, etc.) as well as overall machine translation performance evaluation. Through this research, the paper aims to reveal the shortcomings of foundational LLMs in terms of translation capability and gender bias, and to explore how prompt engineering can effectively alleviate these issues.