Abstract:Equipping agents with the capacity to justify made decisions using supporting evidence represents a cornerstone of accountable decision-making. Furthermore, ensuring that justifications are in line with human expectations and societal norms is vital, especially in high-stakes situations such as healthcare. In this work, we propose the use of a debate-based reward model for reinforcement learning agents, where the outcome of a zero-sum debate game quantifies the justifiability of a decision in a particular state. This reward model is then used to train a justifiable policy, whose decisions can be more easily corroborated with supporting evidence. In the debate game, two argumentative agents take turns providing supporting evidence for two competing decisions. Given the proposed evidence, a proxy of a human judge evaluates which decision is better justified. We demonstrate the potential of our approach in learning policies for prescribing and justifying treatment decisions of septic patients. We show that augmenting the reward with the feedback signal generated by the debate-based reward model yields policies highly favored by the judge when compared to the policy obtained solely from the environment rewards, while hardly sacrificing any performance. Moreover, in terms of the overall performance and justifiability of trained policies, the debate-based feedback is comparable to the feedback obtained from an ideal judge proxy that evaluates decisions using the full information encoded in the state. This suggests that the debate game outputs key information contained in states that is most relevant for evaluating decisions, which in turn substantiates the practicality of combining our approach with human-in-the-loop evaluations. Lastly, we showcase that agents trained via multi-agent debate learn to propose evidence that is resilient to refutations and closely aligns with human preferences.

Argumentative Reward Learning: Reasoning About Human Preferences

Models of human preference for learning reward functions

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input

Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

Reward Design for Justifiable Sequential Decision-Making

Data-Centric Human Preference Optimization with Rationales

Learning Optimal Advantage from Preferences and Mistaking it for Reward

PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking

RewardBench: Evaluating Reward Models for Language Modeling

Fine-Tuning Language Models from Human Preferences

Learning Optimal Behavior Through Reasoning and Experiences

The History and Risks of Reinforcement Learning and Human Feedback

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Human irrationality: both bad and good for reward inference

Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?

The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types

Neuro-Symbolic Forward Reasoning

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

Learning to refer informatively by amortizing pragmatic reasoning