Abstract:Learning from Demonstration (LfD) aims to facilitate rapid Reinforcement Learning (RL) by leveraging expert demonstrations to pre-train the RL agent. However, the limited availability of expert demonstration data often hinders its ability to effectively aid downstream RL learning. To address this problem, we propose a novel two-stage method dubbed as Skill-enhanced Reinforcement Learning Acceleration (SeRLA). SeRLA introduces a skill-level adversarial Positive-Unlabeled (PU) learning model to extract useful skill prior knowledge by enabling learning from both limited expert data and general low-cost demonstration data in the offline prior learning stage. Subsequently, it deploys a skill-based soft actor-critic algorithm to leverage this acquired prior knowledge in the downstream online RL stage for efficient training of a skill policy network. Moreover, we develop a simple skill-level data enhancement technique to further alleviate data sparsity and improve both skill prior learning and downstream skill policy training. Our experimental results on multiple standard RL environments show the proposed SeRLA method achieves state-of-the-art performance on accelerating reinforcement learning on downstream tasks, especially in the early learning phase.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to effectively utilize limited expert demonstration data and a large amount of low - cost general demonstration data to accelerate the training process of Reinforcement Learning (RL) in downstream tasks. Specifically, the author proposes a new method named Skill - enhanced Reinforcement Learning Acceleration (SeRLA), which aims to solve the problem in the following ways: 1. **Skill - level Positive - Unlabeled (PU) Learning**: - A skill - level adversarial PU learning model is proposed, which can extract useful skill prior knowledge from limited expert data and a large amount of low - cost demonstration data in the offline prior learning stage. - Different from traditional methods that only use expert data or simply regard general demonstration data as negative samples, SeRLA takes general demonstration data as unlabeled samples, so as to better utilize the potentially useful information in these data. 2. **Skill - based Soft Actor - Critic Algorithm**: - In the online downstream RL stage, a skill - based Soft Actor - Critic (SAC) algorithm is deployed, which uses the prior knowledge learned in the offline stage to accelerate the training of the skill policy network. - This method applies the learned skills to downstream tasks through behavior cloning, thereby reducing the number of interactions with the environment and improving learning efficiency. 3. **Skill - level Data Enhancement Technique**: - A simple Skill - level Data Enhancement (SDE) technique is introduced to further alleviate the data sparsity problem and improve the robustness of skill prior learning and downstream skill policy training. - SDE generates enhanced data by adding Gaussian noise to the skill embedding vector, making the model more robust to small perturbations. ### Formula Summary - **Reconstruction Loss** (for the training of skill encoder and decoder): \[ L_{\text{rec}}(\nu, \mu) = E_{a_t \sim A^{\pi_e}} \left[ \ell_{\text{ls}} \left( \hat{a}_t \sim p_\nu(\cdot | z_t \sim q_\mu(\cdot | a_t)), a_t \right) \right] \] - **Prior Training Loss** (to ensure that the generated skills are consistent with the given state and action sequences): \[ L_{\text{prior}}(\psi, \mu) = E_{(s_t, a_t) \sim D^{\pi_e}} \left[ L_{\text{KL}}(q_\mu(z_t | a_t), q_\psi(z_t | s_t)) \right] \] - **Regularization Loss** (for regularizing the skill embedding space): \[ L_{\text{reg}}(\mu) = E_{a_t \sim A^{\pi_e}} \left[ L_{\text{KL}}(q_\mu(z_t | a_t), p(z_t)) \right] \] - **PU Loss** (for adversarial PU learning): \[ L_{\text{pu}}^{D_\zeta}(q_\mu(A^{\pi_e}), q_\mu(A^\pi)) = \lambda L_1^{D_\zeta}(q_\mu(A^{\pi_e})) + \max \left( -\xi, L_0^{D_\zeta}(q_\mu(A^\pi)) - \lambda L_0^{D_\zeta}(q_\mu(A^{\pi_e})) \right) \] - **Total Loss Function** (for skill prior training of adversarial PU learning): \[ L(

Skill-Enhanced Reinforcement Learning Acceleration from Demonstrations

Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble

Active Deep Q-learning with Demonstration

Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in Reinforcement Learning

Demonstration actor critic

Demonstration Guided Actor-Critic Deep Reinforcement Learning for Fast Teaching of Robots in Dynamic Environments

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

A Survey of Demonstration Learning

Pre-training with Non-expert Human Demonstration for Deep Reinforcement Learning

Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward

Deep Q-learning From Demonstrations

LIDAR: Learning from Imperfect Demonstrations with Advantage Rectification

Learning from Suboptimal Demonstration via Self-Supervised Reward Regression

Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

ZPD Teaching Strategies for Deep Reinforcement Learning from Demonstrations

Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations

Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement Learning with Sub-optimal Demonstrations

Skill Enhancement Learning with Knowledge Distillation