Abstract:Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that the Reinforcement Learning from Human Feedback (RLHF) method based on a single reward model cannot fully reflect diverse human preferences when aligning language models. Specifically, existing RLHF methods usually rely on a single reward model to represent human preferences, which ignores the diversity of preference distributions among different user groups, causing the model to be likely to be biased towards the preferences of the majority of users and neglect the preferences of minorities. #### Main problem description 1. **Limitations of a single reward model**: - Existing RLHF methods use a single reward model to align language models with human preferences, but this method ignores the diversity in human preferences. - The single reward model assumes that all users have the same preferences, which may lead to the model being biased towards the preferences of the majority of users and neglecting the preferences of minorities, thus causing social bias and social injustice. 2. **Diverse user preferences**: - Different user groups have different preference distributions due to factors such as their social backgrounds and cultural differences. - Ignoring this diversity may lead to the preferences of certain user groups being ignored or underestimated, especially in the case of significant minorities. 3. **Challenges of alignment goals**: - How to design a method so that the language model can better align with multiple human preferences rather than just aligning to a single "standard" preference? - How to ensure that the alignment process not only considers the preferences of the majority but also treats the preferences of minorities fairly? #### Solutions proposed in the paper To solve the above problems, the paper proposes the following innovations: 1. **Impossibility result**: - The paper first shows through mathematical proof that it is impossible to fully cover diverse human preferences using a single reward model for alignment (Theorem 1). This result emphasizes the deficiency of a single reward model in representing diverse preferences. 2. **MaxMin - RLHF algorithm**: - The paper proposes a new method - MaxMin - RLHF, which captures the preference distributions of different user groups by learning multiple reward functions and using the Expectation - Maximization (EM) algorithm. - The goal of MaxMin - RLHF is to maximize the minimum social utility, that is, to ensure that the preferences of each user group can be treated fairly, rather than just being biased towards a specific group. 3. **Experimental verification**: - The paper verifies the effectiveness of MaxMin - RLHF through experiments on small - scale (GPT - 2) and large - scale (Tulu2 - 7B) language models. The experimental results show that, compared with a single reward model, MaxMin - RLHF can better align diverse human preferences, especially with significant improvement for the preferences of minorities. Through these innovations, the paper provides a new perspective and an effective solution to solve the problem of aligning diverse human preferences in RLHF.

MaxMin-RLHF: Alignment with Diverse Human Preferences

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

RRHF: Rank Responses to Align Language Models with Human Feedback

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Confronting Reward Model Overoptimization with Constrained RLHF

Personalized Language Modeling from Personalized Human Feedback

Fine-Tuning Language Models with Reward Learning on Policy

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

ALaRM: Align Language Models via Hierarchical Rewards Modeling

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Towards Reliable Alignment: Uncertainty-aware RLHF

Dual Active Learning for Reinforcement Learning from Human Feedback