Abstract:Hierarchical reinforcement learning (HRL) has achieved remarkable success and significant progress in complex and long-term decision-making problems. However, HRL training typically entails substantial computational costs and an enormous number of samples. One effective approach to tackle this challenge is hierarchical reinforcement learning from demonstrations (HRLfD), which leverages demonstrations to expedite the training process of HRL. The effectiveness of HRLfD is contingent upon the quality of the demonstrations; hence, suboptimal demonstrations may impede efficient learning. To address this issue, this paper proposes a reachability-based reward shaping (RbRS) method to alleviate the negative interference of suboptimal demonstrations for the HRL agent. The novel HRLfD algorithm based on RbRS is named HRLfD-RbRS, which incorporates the RbRS method to enhance the learning efficiency of HRLfD. Moreover, with the help of this method, the learning agent can explore better policies under the guidance of the suboptimal demonstration. We evaluate the proposed HRLfD-RbRS algorithm on various complex robotic tasks, and the experimental results demonstrate that our method outperforms current state-of-the-art HRLfD algorithms.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily aims to address two key issues faced during the training process of Hierarchical Reinforcement Learning (HRL): 1. **Low Training Efficiency**: Traditional HRL algorithms usually require a large amount of computational resources and sample data to converge, which limits their feasibility in practical applications. 2. **Negative Impact of Suboptimal Demonstrations**: Using demonstration data to accelerate the learning process of HRL is an effective method, but when the quality of the demonstration data is not high, it may lead to poor learning outcomes or even incorrect decisions. To tackle these problems, the authors propose a new method—Reachability-based Reward Shaping (RbRS), and apply it within the framework of HRL from Demonstrations (HRLfD). Through this method, even in the face of suboptimal demonstration data, the learning agent can explore better strategies and improve learning efficiency. ### Specific Methods 1. **Reachability Constraint**: By defining an m-step reachable region, the generated sub-goals are within m steps from the demonstration trajectory, ensuring that the learning agent can fully utilize the demonstration data while exploring data superior to the demonstration. 2. **Incorporating Neighbor Constraints**: Introducing the concept of a k-step neighboring region to further improve the quality of the strategy, avoiding reliance solely on demonstration data and falling into suboptimal solutions. 3. **Multi-level Constraints**: Constraints are applied not only to high-level policies but also to low-level policies, ensuring that the entire learning process is more efficient and robust. Through the above methods, the HRLfD-RbRS algorithm proposed in this paper demonstrates superior performance in various complex robotic tasks, showing higher sample efficiency and learning effectiveness compared to existing state-of-the-art algorithms.

Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Demonstration actor critic

Exploration-efficient Deep Reinforcement Learning with Demonstration Guidance for Robot Control

Demonstration Guided Actor-Critic Deep Reinforcement Learning for Fast Teaching of Robots in Dynamic Environments

Data-Efficient Hierarchical Reinforcement Learning for Robotic Assembly Control Applications

Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward

Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement Learning with Sub-optimal Demonstrations

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Overcoming Exploration in Reinforcement Learning with Demonstrations

Shaping in Reinforcement Learning Via Knowledge Transferred from Human-Demonstrations

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards

Residual Reinforcement Learning from Demonstrations

Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

A reinforcement learning algorithm acquires demonstration from the training agent by dividing the task space

Relabeling and policy distillation of hierarchical reinforcement learning

Learning from Suboptimal Demonstration via Self-Supervised Reward Regression

Deep Q-learning From Demonstrations

Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in Reinforcement Learning

An Efficient Unified Approach Using Demonstrations for Inverse Reinforcement Learning