Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

Ruoxi Cheng,Haoxuan Ma,Shuirong Cao,Jiaqi Li,Aihua Pei,Zhiqiang Wang,Pengliang Ji,Haoyu Wang,Jiaqi Huo
2024-06-19
Abstract:Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs demonstrate the effectiveness of our approach in bias mitigation.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the bias problem existing in large language models (LLMs). Specifically, bias can affect user experience and social outcomes, so reducing bias in LLMs is crucial to ensuring the fairness and impartiality of their applications. ### Main Problems 1. **High - intensity Requirement for Human Feedback**: Traditional reinforcement learning methods based on human feedback (RLHF) require a large amount of human intervention, which is not only time - consuming but also costly. 2. **Low Transferability**: Directly querying LLMs can effectively reduce bias in specific conversations, but this effect is difficult to transfer to other topics, resulting in the need to redesign prompt words for new conversations. 3. **Over - confidence and Randomness in Output**: Although LLMs can detect and correct bias through self - reflection, in the absence of external feedback, they often show over - confidence or randomness, resulting in poor reflection effects. ### Solutions To solve the above problems, the author proposes a new method - **Reinforcement Learning from Multi - role Debates as Feedback (RLDF)**. The main features of this method are as follows: - **Dataset Construction**: By allowing LLMs to participate in multi - role debates, generate datasets containing high - bias and low - bias instances for training the reward model in reinforcement learning. - **Two Modes**: - **Self - reflection Mode**: The same LLM participates in multi - role debates to generate and criticize its own content. - **Teacher - Student Mode**: A more advanced LLM (such as GPT - 3.5 - turbo) guides the original LLM to generate more logical and less - biased content. - **Reinforcement Learning**: Use the Proximal Policy Optimization (PPO) algorithm to iteratively update LLM parameters, so that the model gradually produces less - biased outputs. ### Experimental Results The experimental results show that RLDF can effectively reduce bias on multiple LLMs and bias types, surpassing existing related methods. In particular, its performance on five bias types such as age, nationality, institution, beauty, and occupation is particularly remarkable. ### Summary RLDF reduces the dependence on human feedback by introducing multi - role debates, improves the effect of bias mitigation, and has good transferability and stability. This method can not only effectively reduce bias in LLMs but also maintain or improve the quality and coherence of the overall response. Hope this summary is helpful for you to understand the core problems of this paper! If you have more questions, feel free to continue asking.