Abstract:Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO with Calibrated Reward Modeling and PPO-C: PPO with Calibrated Reward Calculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the moving average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experiment results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they do not compromise model capabilities in open-ended conversation settings.

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

RRHF: Rank Responses to Align Language Models with Human Feedback

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Taming Overconfidence in LLMs: Reward Calibration in RLHF

How to Evaluate Reward Models for RLHF

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Fine-Tuning Language Models with Reward Learning on Policy

LongReward: Improving Long-context Large Language Models with AI Feedback

Prototypical Reward Network for Data-Efficient RLHF

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Fine-Tuning Language Models from Human Preferences

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Confronting Reward Model Overoptimization with Constrained RLHF

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

LIRE: listwise reward enhancement for preference alignment

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Fine-Tuning Language Models with Advantage-Induced Policy Alignment