Critique-out-Loud Reward Models

Zachary Ankner,Mansheej Paul,Brandon Cui,Jonathan D. Chang,Prithviraj Ammanabrolu
2024-08-22
Abstract:Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: When traditional reward models perform Reinforcement Learning from Human Feedback (RLHF), they only directly predict preference scores without leveraging the generation capabilities of the underlying large - language models (LLMs). This limits the capabilities of reward models because they must implicitly reason about the quality of responses, that is, preference modeling must be completed in a single forward pass. To enable reward models to explicitly reason about the quality of responses, the authors introduce a new reward model - the Critique - out - Loud (CLoud) reward model. Specifically, the CLoud reward model improves the performance of traditional reward models by first generating natural - language critiques of the assistant's responses and then predicting a scalar reward based on these critiques. This approach allows the reward model to conduct more detailed reasoning when evaluating response quality, thereby improving its accuracy and interpretability. The main contributions of the paper include: - Proposing the CLoud reward model, which generates critiques of responses before scoring. - Verifying through experiments the superior performance of the CLoud reward model in multiple benchmark tests, especially in pairwise preference classification tasks. - Exploring how to utilize the dynamic reasoning and computing capabilities of the CLoud reward model to improve the accuracy of reward prediction through self - consistency decoding. By introducing language - generation capabilities, the CLoud reward model lays the foundation for unifying the classical reward model and the LLM - as - a - Judge framework and inherits the advantages of both.