Generative Reward Models

Dakota Mahan,Duy Van Phung,Rafael Rafailov,Chase Blagden,Nathan Lile,Louis Castricato,Jan-Philipp Fränken,Chelsea Finn,Alon Albalak

2024-10-03

Abstract:Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs). The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs. Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM. However, recent work has shown that synthetic preferences labels may not align well with human preference judgments. To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies. We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments. Empirically, we show that zero-shot LLM-based judgments under-perform compared to Bradley-Terry reward models on in-distribution tasks (between 9-36%). In contrast, GenRM achieves in-distribution accuracy comparable to Bradley-Terry models, while significantly outperforming them on out-of-distribution tasks (between 10-45%). Moreover, GenRM surpasses the performance of using LLMs as judges on both in-distribution (by 9-31%) and out-of-distribution tasks (by 2- 6%). Our results show that combining the strengths of RLHF and RLAIF offers a promising approach for improving the quality of synthetic preference labels.

Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively generate high-quality preference labels in reinforcement learning, particularly by combining the advantages of human feedback (RLHF) and AI feedback (RLAIF) to improve the quality of synthetic preference labels. Specifically, the paper points out: 1. **RLHF** methods, while effective, require a large amount of human preference data to train the reward model, which is challenging in terms of resources and technology. 2. **RLAIF** methods reduce the reliance on human data by using large language models (LLM) to generate synthetic preference labels, but these synthetic preference labels may not align with actual human preferences. To overcome these challenges, the paper proposes a hybrid method called **GenRM** (Generative Reward Model), which iteratively trains the reasoning traces generated by LLM to make synthetic preference labels closer to human preference judgments. The main contributions of the paper include: - The **GenRM** algorithm can generate high-quality synthetic preference labels that perform comparably to traditional Bradley-Terry reward models on in-distribution tasks and significantly outperform the latter on out-of-distribution tasks. - The **CoT-GenRM** variant further enhances performance through Chain-of-Thought (CoT) reasoning, enabling the model to exhibit stronger generalization capabilities in reasoning tasks. Overall, the paper aims to propose a new method to generate high-quality synthetic preference labels by combining the advantages of RLHF and RLAIF, thereby improving the performance and generalization capabilities of reinforcement learning models.

Generative Reward Models

Fine-tuning Language Models with Generative Adversarial Reward Modelling

Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation

Optimal Design for Reward Modeling in RLHF

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Fine-tuning Language Models with Generative Adversarial Feedback

Reward-Robust RLHF in LLMs

Fine-Tuning Language Models with Reward Learning on Policy

Self-Evolved Reward Learning for LLMs

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

How to Evaluate Reward Models for RLHF

Prototypical Reward Network for Data-Efficient RLHF

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

A Critical Look At Tokenwise Reward-Guided Text Generation

The History and Risks of Reinforcement Learning and Human Feedback

Learning Goal-Conditioned Representations for Language Reward Models