A Long Way to Go: Investigating Length Correlations in RLHF

Prasann Singhal,Tanya Goyal,Jiacheng Xu,Greg Durrett
2024-07-11
Abstract:Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper addresses a problem in training large-scale language models using reinforcement learning from human feedback (RLHF), which is that the models tend to generate longer outputs. Despite the improvement in "helpfulness" in tasks such as dialogue and question answering, the research found that reward optimization is primarily driven by response length rather than other features. The paper demonstrates this through three different setups (WebGPT, Stack, and RLCD) and points out that the reward models are easily influenced by length bias in preference data rather than capturing true human preferences. Experiments show that even reward based solely on length can reproduce most of the improvements from RLHF. The authors also test a range of interventions against length bias and find that no single strategy is effective for all setups, but these interventions can adjust the length to be close to the base model without sacrificing performance in some cases. The research findings reveal that the current reward models only capture superficial aspects of human preferences and raise doubts about the "improvements" in PPO, while also highlighting the need for better preference data and downstream evaluation.