Abstract:Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. Although aligned generative models have shown remarkable abilities in various tasks, their reliance on high-quality human preference data creates a costly bottleneck in the practical application of RLHF. One primary reason is that current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations, to collect human feedback, resulting in sub-optimal alignment under a constrained budget, which highlights the criticality of adaptive strategies in efficient alignment. Recent works [Mehta et al., 2023, Muldrew et al., 2024] have tried to address this problem by designing various heuristics based on generation uncertainty. However, either the assumptions in [Mehta et al., 2023] are restrictive, or [Muldrew et al., 2024] do not provide any rigorous theoretical guarantee. To address these, we reformulate RLHF within contextual preference bandit framework, treating prompts as contexts, and develop an active-learning algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), which enhances model alignment by querying preference data from the most important samples, achieving superior performance for small sample budget. We analyze the theoretical performance guarantees of $\texttt{APO}$ under the BTL preference model showing that the suboptimality gap of the policy learned via $\texttt{APO}$ scales as $O(1/\sqrt{T})$ for a budget of $T$. We also show that collecting preference data by choosing prompts randomly leads to a policy that suffers a constant sub-optimality. We perform detailed experimental evaluations on practical preference datasets to validate $\texttt{APO}$'s efficacy over the existing methods, establishing it as a sample-efficient and practical solution of alignment in a cost-effective and scalable manner.

Solving the Inverse Alignment Problem for Efficient RLHF

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Towards Reliable Alignment: Uncertainty-aware RLHF

Reward Difference Optimization For Sample Reweighting In Offline RLHF

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

On The Global Convergence Of Online RLHF With Neural Parametrization

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration

Active Preference Optimization for Sample Efficient RLHF

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Dual Active Learning for Reinforcement Learning from Human Feedback

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

RLHF Workflow: From Reward Modeling to Online RLHF

Stabilizing RLHF through Advantage Model and Selective Rehearsal