Abstract:Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preference-based RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on the weighted sum of non-Markovian rewards. We then design the proposed preference model using a transformer architecture that stacks causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making. Code is available on the project website: <a class="link-external link-https" href="https://sites.google.com/view/preference-transformer" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address the issues present in human preference-based learning methods in Reinforcement Learning (RL), particularly the challenges faced when learning reward functions from human-provided preference information. Specifically, the paper proposes a new architecture called the "Preference Transformer" to overcome the limitations of existing methods. These limitations mainly include: 1. **High demand for human feedback**: Existing preference learning methods often require a large amount of human feedback to align the reward function with human intentions. 2. **Limitations of the Markov assumption**: Most existing methods assume that the reward function is Markovian (i.e., dependent only on the current state and action), which does not hold in many real-world tasks. 3. **Issues with the equal weighting assumption**: Existing methods usually assume that humans evaluate the quality of trajectories by giving equal weight to the rewards at each time step, which may not be realistic. To overcome these issues, the paper contributes the following: - Proposes a new preference modeling framework based on a weighted sum of non-Markovian rewards, capable of capturing the temporal dependencies in human decision-making and identifying key events in trajectories. - Designs a Transformer-based architecture—the Preference Transformer—that can handle non-Markovian rewards and generate rewards and their importance weights for each time step. - Demonstrates that the Preference Transformer can solve problems using real human preference data in various control tasks and can derive well-defined reward functions while automatically capturing important events in trajectories. - Validates the effectiveness and superiority of the Preference Transformer through experiments comparing it with different baseline models, especially in complex navigation, walking, and robotic manipulation tasks. In summary, the goal of this research is to improve the efficiency and practicality of human preference-based reinforcement learning methods by proposing the Preference Transformer, particularly in handling non-Markovian rewards and complex tasks.

Preference Transformer: Modeling Human Preferences using Transformers for RL

PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers

Weak Human Preference Supervision for Deep Reinforcement Learning

On Transforming Reinforcement Learning With Transformers: The Development Trajectory

Hybrid Reinforcement Learning Based on Human Preference and Advice for Efficient Robot Skill Learning

Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning

Reinforcement Learning from Diverse Human Preferences

Decision Transformer: Reinforcement Learning via Sequence Modeling

Transformers are Meta-Reinforcement Learners

On Transforming Reinforcement Learning by Transformer: The Development Trajectory

Personalized Adaptation via In-Context Preference Learning

A Hybrid Online Off-Policy Reinforcement Learning Agent Framework Supported by Transformers

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Rethinking Decision Transformer via Hierarchical Reinforcement Learning

Models of human preference for learning reward functions

LRHP: Learning Representations for Human Preferences via Preference Pairs

Crowd-PrefRL: Preference-Based Reward Learning from Crowds

When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment

TransDreamer: Reinforcement Learning with Transformer World Models

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling

Deep reinforcement learning from human preferences