Abstract:Inverse reinforcement learning (IRL) aims to explicitly infer an underlying reward function based on collected expert demonstrations. Considering that obtaining expert demonstrations can be costly, the focus of current IRL techniques is on learning a better-than-demonstrator policy using a reward function derived from sub-optimal demonstrations. However, existing IRL algorithms primarily tackle the challenge of trajectory ranking ambiguity when learning the reward function. They overlook the crucial role of considering the degree of difference between trajectories in terms of their returns, which is essential for further removing reward ambiguity. Additionally, it is important to note that the reward of a single transition is heavily influenced by the context information within the trajectory. To address these issues, we introduce the Distance-rank Aware Sequential Reward Learning (DRASRL) framework. Unlike existing approaches, DRASRL takes into account both the ranking of trajectories and the degrees of dissimilarity between them to collaboratively eliminate reward ambiguity when learning a sequence of contextually informed reward signals. Specifically, we leverage the distance between policies, from which the trajectories are generated, as a measure to quantify the degree of differences between traces. This distance-aware information is then used to infer embeddings in the representation space for reward learning, employing the contrastive learning technique. Meanwhile, we integrate the pairwise ranking loss function to incorporate ranking information into the latent features. Moreover, we resort to the Transformer architecture to capture the contextual dependencies within the trajectories in the latent space, leading to more accurate reward estimation. Through extensive experimentation, our DRASRL framework demonstrates significant performance improvements over previous SOTA methods.

Sequence Prediction with Unlabeled Data by Reward Function Learning

ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models

Positive-Unlabeled Reward Learning

Reinforcement Learning from Bagged Reward

ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation

Auxiliary Reward Generation with Transition Distance Representation Learning

Accelerating Exploration with Unlabeled Prior Data

Semi-supervised reward learning for offline reinforcement learning

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Deep Reinforcement Learning For Sequence to Sequence Models

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings

Semi-Supervised Reward Modeling via Iterative Self-Training

Dense Reward for Free in Reinforcement Learning from Human Feedback

Generalization in Visual Reinforcement Learning with the Reward Sequence Distribution

Video Prediction Models as Rewards for Reinforcement Learning

Reinforcement Learning from Bagged Reward: A Transformer-based Approach for Instance-Level Reward Redistribution

Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement Learning with Sub-optimal Demonstrations

Learning Robust Representation for Reinforcement Learning with Distractions by Reward Sequence Prediction.

RL-MD: A Novel Reinforcement Learning Approach for DNA Motif Discovery

Teacher Forcing Recovers Reward Functions for Text Generation

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding