Abstract:Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $\varphi$. Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just $940$ samples) while advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: <a class="link-external link-http" href="http://github.com/efficient-rlhf" rel="external noopener nofollow">this http URL</a>. PromptOpinSumm: <a class="link-external link-http" href="http://hf.co/prompt-opin-summ" rel="external noopener nofollow">this http URL</a>. OpinPref: <a class="link-external link-http" href="http://hf.co/opin-pref" rel="external noopener nofollow">this http URL</a>) for usage under MIT License.

Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

How to Evaluate Reward Models for RLHF

Optimal Design for Reward Modeling in RLHF

RewardBench: Evaluating Reward Models for Language Modeling

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

Fine-Tuning Language Models from Human Preferences

RLHF Workflow: From Reward Modeling to Online RLHF

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

The History and Risks of Reinforcement Learning and Human Feedback

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Confronting Reward Model Overoptimization with Constrained RLHF

Prototypical Reward Network for Data-Efficient RLHF

A Survey of Reinforcement Learning from Human Feedback

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Parameter Efficient Reinforcement Learning from Human Feedback