Abstract:The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear why this model -- originally developed for multi-player stochastic game matching -- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first revisit the foundations of using BT models in reward modeling, and establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization. This is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of order consistency in reward modeling and demonstrate that the BT model possesses this property. Consequently, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using $6$ base LLMs, $2$ datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly explores the theoretical basis, effectiveness, and alternatives of using the Bradley - Terry (BT) model for preference reward modeling in large - language - model (LLM) alignment. Specifically, the paper aims to answer the following key questions: 1. **Is the application of the BT model theoretically sound when the number of players is greater than the number of comparisons?** - The paper analyzes the theoretical basis for applying the BT model in LLM alignment and explores its practical performance under sparse comparisons. 2. **Are there other feasible reward - modeling methods besides the BT model?** - The paper proposes a simple classification algorithm based on order consistency as an alternative and demonstrates the effectiveness of these methods. 3. **The traditional BT model assumes randomized pairwise comparisons. Can cross - prompt comparisons be more effectively used for reward modeling?** - The paper proves the superiority of cross - prompt annotations in LLM alignment through theoretical analysis and experiments. ### Main contributions 1. **Comprehensive analysis**: Provides a comprehensive analysis of the application of the BT model in LLM alignment, comparing its traditional use in multi - player arenas with the unique challenges in LLM alignment. 2. **Theoretical basis**: Introduces the first asymptotic theory of BT regression on neural networks in preference reward modeling and establishes the first risk bound for BT - model reward estimation in LLM alignment. 3. **Practical suggestions**: Proposes and validates order consistency as the core objective of reward modeling and shows how to derive the BT model and an alternative classification method from this principle. 4. **Empirical research**: Conducts extensive experiments, covering 6 base LLMs, 2 datasets, 3 response sampling methods, 6 annotation noise levels, 3 reward - model implementations, 4 annotation - availability scenarios, and 5 random seeds, with more than 12,000 runs in total, proving the statistical validity of the classification - based reward model and comparing it with the BT model. ### Conclusion By rethinking the application of the BT model in LLM alignment, the paper not only provides profound theoretical insights but also offers new ideas and tools for practical applications, especially in dealing with sparse comparisons and exploring different types of comparisons (such as cross - prompt comparisons).

Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives

Reward Learning From Preference With Ties

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Bayesian Reward Models for LLM Alignment

RewardBench: Evaluating Reward Models for Language Modeling

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Confronting Reward Model Overoptimization with Constrained RLHF

General Preference Modeling with Preference Representations for Aligning Language Models

HelpSteer2-Preference: Complementing Ratings with Preferences

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Towards Understanding the Influence of Reward Margin on Preference Model Performance

Transforming and Combining Rewards for Aligning Large Language Models

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

How to Evaluate Reward Models for RLHF

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Aligning Crowd Feedback via Distributional Preference Reward Modeling