Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Xiaozhu Gao,Jinhui Liu,Bo Wan,Lingling An
DOI: https://doi.org/10.1007/s11063-024-11632-x
IF: 2.565
2024-05-29
Neural Processing Letters
Abstract:Hierarchical reinforcement learning (HRL) has achieved remarkable success and significant progress in complex and long-term decision-making problems. However, HRL training typically entails substantial computational costs and an enormous number of samples. One effective approach to tackle this challenge is hierarchical reinforcement learning from demonstrations (HRLfD), which leverages demonstrations to expedite the training process of HRL. The effectiveness of HRLfD is contingent upon the quality of the demonstrations; hence, suboptimal demonstrations may impede efficient learning. To address this issue, this paper proposes a reachability-based reward shaping (RbRS) method to alleviate the negative interference of suboptimal demonstrations for the HRL agent. The novel HRLfD algorithm based on RbRS is named HRLfD-RbRS, which incorporates the RbRS method to enhance the learning efficiency of HRLfD. Moreover, with the help of this method, the learning agent can explore better policies under the guidance of the suboptimal demonstration. We evaluate the proposed HRLfD-RbRS algorithm on various complex robotic tasks, and the experimental results demonstrate that our method outperforms current state-of-the-art HRLfD algorithms.
computer science, artificial intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily aims to address two key issues faced during the training process of Hierarchical Reinforcement Learning (HRL): 1. **Low Training Efficiency**: Traditional HRL algorithms usually require a large amount of computational resources and sample data to converge, which limits their feasibility in practical applications. 2. **Negative Impact of Suboptimal Demonstrations**: Using demonstration data to accelerate the learning process of HRL is an effective method, but when the quality of the demonstration data is not high, it may lead to poor learning outcomes or even incorrect decisions. To tackle these problems, the authors propose a new method—Reachability-based Reward Shaping (RbRS), and apply it within the framework of HRL from Demonstrations (HRLfD). Through this method, even in the face of suboptimal demonstration data, the learning agent can explore better strategies and improve learning efficiency. ### Specific Methods 1. **Reachability Constraint**: By defining an m-step reachable region, the generated sub-goals are within m steps from the demonstration trajectory, ensuring that the learning agent can fully utilize the demonstration data while exploring data superior to the demonstration. 2. **Incorporating Neighbor Constraints**: Introducing the concept of a k-step neighboring region to further improve the quality of the strategy, avoiding reliance solely on demonstration data and falling into suboptimal solutions. 3. **Multi-level Constraints**: Constraints are applied not only to high-level policies but also to low-level policies, ensuring that the entire learning process is more efficient and robust. Through the above methods, the HRLfD-RbRS algorithm proposed in this paper demonstrates superior performance in various complex robotic tasks, showing higher sample efficiency and learning effectiveness compared to existing state-of-the-art algorithms.