On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Jiancong Xiao,Ziniu Li,Xingyu Xie,Emily Getzen,Cong Fang,Qi Long,Weijie J. Su

2024-05-26

Abstract:Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.

Machine Learning,Methodology

What problem does this paper attempt to address?

This paper mainly explores the problem of how large-scale language models (LLMs) can accurately align with human preferences. The authors point out that reinforcement learning from human feedback (RLHF) methods have inherent algorithmic biases when training LLMs, which may lead to the neglect of preferences from minority groups, known as "preference collapse". To address this issue, they propose Preference Matching (PM) RLHF, a new method that theoretically guarantees the alignment of LLMs with the preference distribution of the reward model and provides statistical guarantees. In standard RLHF, algorithmic bias originates from regularization based on KL divergence, where the pre-trained LLM serves as a reference model and may result in the transmission of unaligned bias. The authors solve a system of ordinary differential equations to obtain a regularizer in the form of negative log response probability distribution, balancing response diversity and reward maximization. Furthermore, they extend this method to accommodate more general preference models and introduce a conditional PM RLHF variant for natural language generation. Experiments demonstrate that conditional PM RLHF outperforms standard RLHF on the OPT-1.3B and Llama-2-7B models, improving the alignment with human preferences and reducing biases by 29% to 41%. In conclusion, this paper addresses the problem of avoiding algorithmic bias and ensuring fair and economical decisions when training LLMs. It proposes a new reinforcement learning method that accurately reflects diverse human preferences.

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning Large Language Models with Human Preferences through Representation Engineering

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards

On Diversified Preferences of Large Language Model Alignment

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Towards Reliable Alignment: Uncertainty-aware RLHF

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

ALaRM: Align Language Models via Hierarchical Rewards Modeling

Aligning Large Language Models via Fine-grained Supervision

Taming Overconfidence in LLMs: Reward Calibration in RLHF

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences