Abstract:Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling. Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model.

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

ALaRM: Align Language Models via Hierarchical Rewards Modeling

Fine-tuning Language Models with Generative Adversarial Reward Modelling

Generative Reward Models

Aligning Large Language Models with Representation Editing: A Control Perspective

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Fine-tuning Language Models with Generative Adversarial Feedback

ARGS: Alignment as Reward-Guided Search

RAIN: Your Language Models Can Align Themselves Without Finetuning

Reward-Robust RLHF in LLMs

RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

The Real, the Better: Aligning Large Language Models with Online Human Behaviors

Progressively Label Enhancement for Large Language Model Alignment

Prior Constraints-based Reward Model Training for Aligning Large Language Models

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

On Diversified Preferences of Large Language Model Alignment

Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models

Offline Regularised Reinforcement Learning for Large Language Models Alignment