DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Tzu-Han Lin,Chen-An Li,Hung-yi Lee,Yun-Nung Chen
2024-10-06
Abstract:Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively integrate domain-specific knowledge into a general reward model when training reward models in Reinforcement Learning from Human Feedback (RLHF), in order to improve model performance and reduce the reliance on large amounts of annotated data. Specifically, the paper proposes a method called DogeRM, which enhances the reward model's performance on specific tasks without requiring a large amount of domain-specific preference data by merging domain-specific language models with general reward models through model fusion techniques. ### Main Contributions 1. **Proposed DogeRM Framework**: This framework integrates domain-specific knowledge into a general reward model through model fusion techniques to improve the model's performance on specific tasks. 2. **Reduced Annotation Data Requirement**: By leveraging existing domain-specific language models, the need to collect large amounts of domain-specific preference data is reduced, lowering costs and time consumption. 3. **Experimental Validation**: Experiments on multiple benchmarks demonstrate the effectiveness of DogeRM, with significant performance improvements in mathematics and programming tasks. ### Solution Approach - **Model Fusion**: A general reward model is fused with a domain-specific Supervised Fine-Tuning (SFT) model to generate a new reward model. - **Weighted Averaging**: Model fusion is achieved by weighted averaging of the parameters of the two models. A weight factor λ is used to control the contribution ratio of the general model and the domain-specific model. - **Experimental Evaluation**: DogeRM's performance is evaluated on multiple benchmarks, including RewardBench, Auto-J Eval, and datasets like GSM8K and MBPP. ### Experimental Results - **Performance Improvement**: DogeRM significantly improves performance in mathematics and programming tasks, especially on RewardBench and Auto-J Eval. - **Generalization Ability**: DogeRM not only performs well on specific tasks but also generalizes across different model architectures, showing good adaptability. ### Conclusion DogeRM successfully integrates domain-specific knowledge into a general reward model through model fusion techniques, improving the model's performance on specific tasks and reducing the reliance on large amounts of annotated data. This method excels in mathematics and programming tasks and has broad applicability and potential.