Abstract:Reward models have been increasingly critical for improving the reasoning capability of LLMs. Existing research has shown that a well-trained reward model can substantially improve model performances at inference time via search. However, the potential of reward models during RL training time still remains largely under-explored. It is currently unclear whether these reward models can provide additional training signals to enhance the reasoning capabilities of LLMs in RL training that uses sparse success rewards, which verify the correctness of solutions. In this work, we evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM), and train a collection of LLMs for math problems using RL by combining these learned rewards with success rewards. Surprisingly, even though these learned reward models have strong inference-time performances, they may NOT help or even hurt RL training, producing worse performances than LLMs trained with the success reward only. Our analysis reveals that an LLM can receive high rewards from some of these reward models by repeating correct but unnecessary reasoning steps, leading to a severe reward hacking issue. Therefore, we introduce two novel reward refinement techniques, including Clipping and Delta. The key idea is to ensure the accumulative reward of any reasoning trajectory is upper-bounded to keep a learned reward model effective without being exploited. We evaluate our techniques with multiple reward models over a set of 1.5B and 7B LLMs on MATH and GSM8K benchmarks and demonstrate that with a carefully designed reward function, RL training without any additional supervised tuning can improve all the evaluated LLMs, including the state-of-the-art 7B LLM Qwen2.5-Math-7B-Instruct on MATH and GSM8K benchmarks.

Rethinking Information Structures in RLHF: Reward Generalization from a Graph Theory Perspective

Reward Generalization in RLHF: A Topological Perspective

Secrets of RLHF in Large Language Models Part II: Reward Modeling

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Prototypical Reward Network for Data-Efficient RLHF

Reward-Robust RLHF in LLMs

Mitigating Reward Hacking Via Information-Theoretic Reward Modeling.

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Optimal Design for Reward Modeling in RLHF

Stabilizing RLHF through Advantage Model and Selective Rehearsal

Confronting Reward Model Overoptimization with Constrained RLHF

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

On Designing Effective RL Reward at Training Time for LLM Reasoning

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs