Abstract: Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference. However, such human-in-the-loop formulation requires considerable human effort to assign preference labels to segment pairs, hindering its large-scale applications. Recent approache has tried to reuse unlabeled segments, which implicitly elucidates the distribution of segments and thereby alleviates the human effort. And consistency regularization is further considered to improve the performance of semi-supervised learning. However, we notice that, unlike general classification tasks, in PbRL there exits a unique phenomenon that we defined as similarity trap in this paper. Intuitively, human can have diametrically opposite preferredness for similar segment pairs, but such similarity may trap consistency regularization fail in PbRL. Due to the existence of similarity trap, such consistency regularization improperly enhances the consistency possiblity of the model's predictions between segment pairs, and thus reduces the confidence in reward learning, since the augmented distribution does not match with the original one in PbRL. To overcome such issue, we present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions. Empirically, we demonstrate that our approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization.

Bootstrapped Reward Shaping

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

Reward Shaping Based on Optimal-Policy-Free

Benchmarking Potential Based Rewards for Learning Humanoid Locomotion

A new Potential-Based Reward Shaping for Reinforcement Learning Agent

Potential-Based Reward Shaping For Intrinsic Motivation

Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards

On the Sample Efficiency of Abstractions and Potential-Based Reward Shaping in Reinforcement Learning

BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

Shaping Reward Learning Approach from Passive Samples

Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

The Guiding Role of Reward Based on Phased Goal in Reinforcement Learning.

Principled Reward Shaping for Reinforcement Learning Via Lyapunov Stability Theory

Barrier Functions Inspired Reward Shaping for Reinforcement Learning

Hierarchical Potential-based Reward Shaping from Task Specifications

Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards

Learning to Shape Rewards Using a Game of Two Partners

Offline Reward Shaping with Scaling Human Preference Feedback for Deep Reinforcement Learning

Shaping Proto-Value Functions via Rewards

ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization