Abstract:In actor-critic reinforcement learning (RL) algorithms, function estimation errors are known to cause ineffective random exploration at the beginning of training, and lead to overestimated value estimates and suboptimal policies. In this paper, we address the problem by executing advantage rectification with imperfect demonstrations, thus reducing the function estimation errors. Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain. However, existing methods, such as behavior cloning, often assume the demonstrations contain other information or labels with regard to performances, such as optimal assumption, which is usually incorrect and useless in the real world. In this paper, we explicitly handle imperfect demonstrations within the actor-critic RL frameworks, and propose a new method called learning from imperfect demonstrations with advantage rectification (LIDAR). LIDAR utilizes a rectified loss function to merely learn from selective demonstrations, which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy. LIDAR learns from contradictions caused by estimation errors, and in turn reduces estimation errors. We apply LIDAR to three popular actor-critic algorithms, DDPG, TD3 and SAC, and experiments show that our method can observably reduce the function estimation errors, effectively leverage demonstrations far from the optimal, and outperform state-of-the-art baselines consistently in all the scenarios.

Pretrain Soft Q-Learning with Imperfect Demonstrations.

Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations

Bayesian Q-learning With Imperfect Expert Demonstrations

Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning

Pre-training with Non-expert Human Demonstration for Deep Reinforcement Learning

Deep Q-learning From Demonstrations

ZPD Teaching Strategies for Deep Reinforcement Learning from Demonstrations

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

Imitation Learning from Purified Demonstrations

Active Deep Q-learning with Demonstration

Demonstration actor critic

Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning

CEIP: Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations

LIDAR: Learning from Imperfect Demonstrations with Advantage Rectification

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning

Shaping in Reinforcement Learning by Knowledge Transferred from Human-Demonstrations of a Simple Similar Task.

Adaptive Cooperative Exploration for Reinforcement Learning from Imperfect Demonstrations

A reinforcement learning algorithm acquires demonstration from the training agent by dividing the task space

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Learning from Suboptimal Demonstration via Self-Supervised Reward Regression