A Long Way to Go: Investigating Length Correlations in RLHF

Prasann Singhal,Tanya Goyal,Jiacheng Xu,Greg Durrett

2024-07-11

Abstract:Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper addresses a problem in training large-scale language models using reinforcement learning from human feedback (RLHF), which is that the models tend to generate longer outputs. Despite the improvement in "helpfulness" in tasks such as dialogue and question answering, the research found that reward optimization is primarily driven by response length rather than other features. The paper demonstrates this through three different setups (WebGPT, Stack, and RLCD) and points out that the reward models are easily influenced by length bias in preference data rather than capturing true human preferences. Experiments show that even reward based solely on length can reproduce most of the improvements from RLHF. The authors also test a range of interventions against length bias and find that no single strategy is effective for all setups, but these interventions can adjust the length to be close to the base model without sacrificing performance in some cases. The research findings reveal that the current reward models only capture superficial aspects of human preferences and raise doubts about the "improvements" in PPO, while also highlighting the need for better preference data and downstream evaluation.

A Long Way to Go: Investigating Length Correlations in RLHF

Disentangling Length from Quality in Direct Preference Optimization

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Loose Lips Sink Ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

LongReward: Improving Long-context Large Language Models with AI Feedback

Confronting Reward Model Overoptimization with Constrained RLHF

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Post-hoc Reward Calibration: A Case Study on Length Bias

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Teaching Large Language Models to Reason with Reinforcement Learning

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment

RRHF: Rank Responses to Align Language Models with Human Feedback

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble