Abstract:Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively integrate domain-specific knowledge into a general reward model when training reward models in Reinforcement Learning from Human Feedback (RLHF), in order to improve model performance and reduce the reliance on large amounts of annotated data. Specifically, the paper proposes a method called DogeRM, which enhances the reward model's performance on specific tasks without requiring a large amount of domain-specific preference data by merging domain-specific language models with general reward models through model fusion techniques. ### Main Contributions 1. **Proposed DogeRM Framework**: This framework integrates domain-specific knowledge into a general reward model through model fusion techniques to improve the model's performance on specific tasks. 2. **Reduced Annotation Data Requirement**: By leveraging existing domain-specific language models, the need to collect large amounts of domain-specific preference data is reduced, lowering costs and time consumption. 3. **Experimental Validation**: Experiments on multiple benchmarks demonstrate the effectiveness of DogeRM, with significant performance improvements in mathematics and programming tasks. ### Solution Approach - **Model Fusion**: A general reward model is fused with a domain-specific Supervised Fine-Tuning (SFT) model to generate a new reward model. - **Weighted Averaging**: Model fusion is achieved by weighted averaging of the parameters of the two models. A weight factor λ is used to control the contribution ratio of the general model and the domain-specific model. - **Experimental Evaluation**: DogeRM's performance is evaluated on multiple benchmarks, including RewardBench, Auto-J Eval, and datasets like GSM8K and MBPP. ### Experimental Results - **Performance Improvement**: DogeRM significantly improves performance in mathematics and programming tasks, especially on RewardBench and Auto-J Eval. - **Generalization Ability**: DogeRM not only performs well on specific tasks but also generalizes across different model architectures, showing good adaptability. ### Conclusion DogeRM successfully integrates domain-specific knowledge into a general reward model through model fusion techniques, improving the model's performance on specific tasks and reducing the reliance on large amounts of annotated data. This method excels in mathematics and programming tasks and has broad applicability and potential.

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Prototypical Reward Network for Data-Efficient RLHF

RRM: Robust Reward Model Training Mitigates Reward Hacking

Secrets of RLHF in Large Language Models Part II: Reward Modeling

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Reward-Robust RLHF in LLMs

Self-Evolved Reward Learning for LLMs

Towards Understanding the Influence of Reward Margin on Preference Model Performance

Semi-Supervised Reward Modeling via Iterative Self-Training

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

Reward Modeling Requires Automatic Adjustment Based on Data Quality

LongReward: Improving Long-context Large Language Models with AI Feedback

ALaRM: Align Language Models via Hierarchical Rewards Modeling

REvolve: Reward Evolution with Large Language Models using Human Feedback

WARM: On the Benefits of Weight Averaged Reward Models

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

How to Evaluate Reward Models for RLHF