Abstract:Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $\varphi$. Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just $940$ samples) while advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: <a class="link-external link-http" href="http://github.com/efficient-rlhf" rel="external noopener nofollow">this http URL</a>. PromptOpinSumm: <a class="link-external link-http" href="http://hf.co/prompt-opin-summ" rel="external noopener nofollow">this http URL</a>. OpinPref: <a class="link-external link-http" href="http://hf.co/opin-pref" rel="external noopener nofollow">this http URL</a>) for usage under MIT License.

How to Evaluate Reward Models for RLHF

How to Evaluate Reward Models for RLHF

RewardBench: Evaluating Reward Models for Language Modeling

Prototypical Reward Network for Data-Efficient RLHF

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Optimal Design for Reward Modeling in RLHF

The History and Risks of Reinforcement Learning and Human Feedback

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

RLHF Workflow: From Reward Modeling to Online RLHF

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

RRHF: Rank Responses to Align Language Models with Human Feedback

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Towards Understanding the Influence of Reward Margin on Preference Model Performance

Prototypical Reward Network for Data-Efficient Model Alignment

RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment