Abstract:Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $\varphi$. Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just $940$ samples) while advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: <a class="link-external link-http" href="http://github.com/efficient-rlhf" rel="external noopener nofollow">this http URL</a>. PromptOpinSumm: <a class="link-external link-http" href="http://hf.co/prompt-opin-summ" rel="external noopener nofollow">this http URL</a>. OpinPref: <a class="link-external link-http" href="http://hf.co/opin-pref" rel="external noopener nofollow">this http URL</a>) for usage under MIT License.

Measuring memorization in RLHF for code completion

Measuring memorization in RLHF for code completion

Understanding and Alleviating Memory Consumption in RLHF for LLMs

Stabilizing RLHF through Advantage Model and Selective Rehearsal

RRHF: Rank Responses to Align Language Models with Human Feedback

Investigating on RLHF methodology

Teaching Large Language Models to Reason with Reinforcement Learning

Solving the Inverse Alignment Problem for Efficient RLHF

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Disentangling Length from Quality in Direct Preference Optimization

Active Preference Optimization for Sample Efficient RLHF

Mitigating the Alignment Tax of RLHF

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

On Memorization of Large Language Models in Logical Reasoning

Confronting Reward Model Overoptimization with Constrained RLHF

Active Preference Learning for Large Language Models

Parameter Efficient Reinforcement Learning from Human Feedback