Abstract:Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: When traditional reward models perform Reinforcement Learning from Human Feedback (RLHF), they only directly predict preference scores without leveraging the generation capabilities of the underlying large - language models (LLMs). This limits the capabilities of reward models because they must implicitly reason about the quality of responses, that is, preference modeling must be completed in a single forward pass. To enable reward models to explicitly reason about the quality of responses, the authors introduce a new reward model - the Critique - out - Loud (CLoud) reward model. Specifically, the CLoud reward model improves the performance of traditional reward models by first generating natural - language critiques of the assistant's responses and then predicting a scalar reward based on these critiques. This approach allows the reward model to conduct more detailed reasoning when evaluating response quality, thereby improving its accuracy and interpretability. The main contributions of the paper include: - Proposing the CLoud reward model, which generates critiques of responses before scoring. - Verifying through experiments the superior performance of the CLoud reward model in multiple benchmark tests, especially in pairwise preference classification tasks. - Exploring how to utilize the dynamic reasoning and computing capabilities of the CLoud reward model to improve the accuracy of reward prediction through self - consistency decoding. By introducing language - generation capabilities, the CLoud reward model lays the foundation for unifying the classical reward model and the LLM - as - a - Judge framework and inherits the advantages of both.

Critique-out-Loud Reward Models

Self-Generated Critiques Boost Reward Modeling for Language Models

RewardBench: Evaluating Reward Models for Language Modeling

How to Evaluate Reward Models for RLHF

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Improving Reward Models with Synthetic Critiques

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Self-Rewarding Language Models

Towards Understanding the Influence of Reward Margin on Preference Model Performance

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Secrets of RLHF in Large Language Models Part II: Reward Modeling

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Fine-Tuning Language Models from Human Preferences

Generative Reward Models

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Taming Overconfidence in LLMs: Reward Calibration in RLHF