Abstract:Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs demonstrate the effectiveness of our approach in bias mitigation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the bias problem existing in large language models (LLMs). Specifically, bias can affect user experience and social outcomes, so reducing bias in LLMs is crucial to ensuring the fairness and impartiality of their applications. ### Main Problems 1. **High - intensity Requirement for Human Feedback**: Traditional reinforcement learning methods based on human feedback (RLHF) require a large amount of human intervention, which is not only time - consuming but also costly. 2. **Low Transferability**: Directly querying LLMs can effectively reduce bias in specific conversations, but this effect is difficult to transfer to other topics, resulting in the need to redesign prompt words for new conversations. 3. **Over - confidence and Randomness in Output**: Although LLMs can detect and correct bias through self - reflection, in the absence of external feedback, they often show over - confidence or randomness, resulting in poor reflection effects. ### Solutions To solve the above problems, the author proposes a new method - **Reinforcement Learning from Multi - role Debates as Feedback (RLDF)**. The main features of this method are as follows: - **Dataset Construction**: By allowing LLMs to participate in multi - role debates, generate datasets containing high - bias and low - bias instances for training the reward model in reinforcement learning. - **Two Modes**: - **Self - reflection Mode**: The same LLM participates in multi - role debates to generate and criticize its own content. - **Teacher - Student Mode**: A more advanced LLM (such as GPT - 3.5 - turbo) guides the original LLM to generate more logical and less - biased content. - **Reinforcement Learning**: Use the Proximal Policy Optimization (PPO) algorithm to iteratively update LLM parameters, so that the model gradually produces less - biased outputs. ### Experimental Results The experimental results show that RLDF can effectively reduce bias on multiple LLMs and bias types, surpassing existing related methods. In particular, its performance on five bias types such as age, nationality, institution, beauty, and occupation is particularly remarkable. ### Summary RLDF reduces the dependence on human feedback by introducing multi - role debates, improves the effect of bias mitigation, and has good transferability and stability. This method can not only effectively reduce bias in LLMs but also maintain or improve the quality and coherence of the overall response. Hope this summary is helpful for you to understand the core problems of this paper! If you have more questions, feel free to continue asking.

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

A Multi-LLM Debiasing Framework

Cognitive Bias in Decision-Making with LLMs

Benchmarking Bias in Large Language Models during Role-Playing

REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Uncovering Biases with Reflective Large Language Models

Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Social Debiasing for Fair Multi-modal LLMs

Investigating Bias Representations in Llama 2 Chat via Activation Steering

Editable Fairness: Fine-Grained Bias Mitigation in Language Models

The Perfect Blend: Redefining RLHF with Mixture of Judges

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Mitigate Bias in Face Recognition using Skewness-Aware Reinforcement Learning

Argumentative Experience: Reducing Confirmation Bias on Controversial Issues through LLM-Generated Multi-Persona Debates

Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions

Systematic Biases in LLM Simulations of Debates

LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models

Provable Multi-Party Reinforcement Learning with Diverse Human Feedback