The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Aleix Sant,Carlos Escolano,Audrey Mash,Francesca De Luca Fornaciari,Maite Melero

2024-07-26

Abstract:This paper studies gender bias in machine translation through the lens of Large Language Models (LLMs). Four widely-used test sets are employed to benchmark various base LLMs, comparing their translation quality and gender bias against state-of-the-art Neural Machine Translation (NMT) models for English to Catalan (En $\rightarrow$ Ca) and English to Spanish (En $\rightarrow$ Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.

Computation and Language

What problem does this paper attempt to address?

### The Problem the Paper Aims to Solve This paper aims to address the issue of gender bias in machine translation and attempts to mitigate this bias through prompt engineering. Specifically: 1. **Research Background**: - Gender bias is prevalent in machine translation systems, potentially leading to unfair representation or resource allocation for certain groups. - Large Language Models (LLMs), despite their excellent performance in natural language processing tasks, have been less studied for gender bias in the field of machine translation. 2. **Main Objectives**: - Benchmark Comparison: Evaluate the translation quality and gender bias of different foundational LLMs against state-of-the-art Neural Machine Translation (NMT) models using widely used test sets (such as FLoRes-200, WinoMT, Gold BUG, and MuST-SHE). - The study finds that foundational LLMs perform worse in terms of gender bias compared to NMT models. - Explore the effectiveness of prompt engineering in mitigating gender bias in LLMs, particularly by applying specific prompt structures to instruction-tuned LLMs to reduce gender bias. 3. **Specific Methods**: - Conduct experiments using various prompt techniques (such as few-shot prompting, providing contextual information, and chain-of-thought instructions) to find the optimal prompt structure that significantly reduces gender bias. - Evaluate the effectiveness of various prompts on the WinoMT dataset and apply the best prompts to other gender bias test sets (Gold BUG, MuST-SHE, etc.) as well as overall machine translation performance evaluation. Through this research, the paper aims to reveal the shortcomings of foundational LLMs in terms of translation capability and gender bias, and to explore how prompt engineering can effectively alleviate these issues.

The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Gender-specific Machine Translation with Large Language Models

Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

Investigating Markers and Drivers of Gender Bias in Machine Translations

UnMASKed: Quantifying Gender Biases in Masked Language Models through Linguistically Informed Job Market Prompts

Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora

A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation

Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

Evaluating Gender Bias in Machine Translation

Under the Morphosyntactic Lens: A Multifaceted Evaluation of Gender Bias in Speech Translation

Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Assessing Gender Bias in LLMs: Comparing LLM Outputs with Human Perceptions and Official Statistics

Unraveling Downstream Gender Bias from Large Language Models: A Study on AI Educational Writing Assistance

Social Bias Evaluation for Large Language Models Requires Prompt Variations

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Gender Bias in Large Language Models across Multiple Languages

Gender Bias in LLM-generated Interview Responses

LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses

Gender Bias in Multilingual Neural Machine Translation: The Architecture Matters

Equalizing Gender Biases in Neural Machine Translation with Word Embeddings Techniques