Abstract:Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as "tied" between the two responses) and loses more fine-grained information (such as "slightly better"). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the sociological concept of the wisdom of the crowd. Under the condition, we develop a natural probability model for pairwise preference data under ordinal feedback and analyze its properties. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback. The proposed learning objective and the theory also extend to hinge loss and direct policy optimization (DPO). In particular, the theoretical analysis may be of independent interest when applying to a seemingly unrelated problem of knowledge distillation to interpret the bias-variance trade-off therein. The framework also sheds light on writing guidance for human annotators. Our numerical experiments validate that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings. Further experiments show that incorporating a certain proportion of samples with tied preference boosts RM learning.

Apple Tasting: Combinatorial Dimensions and Minimax Rates

Deterministic Apple Tasting

Apple Tasting Revisited: Bayesian Approaches to Partially Monitored Online Binary Classification

Online Learning with Set-Valued Feedback

Combinatorial Bandits with Relative Feedback

Bounds on the price of feedback for mistake-bounded online learning

Online Learning with Feedback Graphs: Beyond Bandits

Online Ranking with Top-1 Feedback

Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs

Bandits with Preference Feedback: A Stackelberg Game Perspective

Noise-Tolerant Interactive Learning from Pairwise Comparisons

Faster Convergence with Multiway Preferences

A Combinatorial Characterization of Supervised Online Learnability

Online Learning from Strategic Human Feedback in LLM Fine-Tuning

Practical, Provably-Correct Interactive Learning in the Realizable Setting: The Power of True Believers

Sequential Probability Assignment with Contexts: Minimax Regret, Contextual Shtarkov Sums, and Contextual Normalized Maximum Likelihood

Multiclass Online Learnability under Bandit Feedback

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Optimal Learners for Realizable Regression: PAC Learning and Online Learning

Online Learning with Feedback Graphs: The True Shape of Regret

The Real Price of Bandit Information in Multiclass Classification