Relabeling and policy distillation of hierarchical reinforcement learning

DOI: https://doi.org/10.1007/s13042-024-02192-6
2024-05-12
International Journal of Machine Learning and Cybernetics
Abstract:Hierarchical reinforcement learning (HRL) is a promising method to extend traditional reinforcement learning to solve more complex tasks. HRL can solve the problems of long-term reward sparsity and credit assignment. However, the existing HRL methods are trained in specific environments and target tasks each time, resulting in low sample utilization. In addition, the low-level sub-policies of the agent will interfere with each other during the migration process, resulting in poor policy stability. Aiming at the issue above, this paper proposes an HRL method, Relabeling and Policy Distillation of Hierarchical Reinforcement Learning (R-PD-HRL), that integrates meta-learning, shared reward relabeling and policy distillation to accelerate the learning speed and improve the policy stability of the agent. In the training process, a reward relabeling module is introduced to act on the experience buffer. Different reward functions are used to relabel the interaction trajectory for the training of other tasks under the same task distribution. At the low-level, policy distillation technology is used to compress the sub-policies of the low-level, and the interference between the policies is reduced while ensuring the correctness of the original low-level sub-policies. Finally, according to different tasks, the high-level policy calls the low-level optimal policy to complete the decision. In both continuous and discrete state-action environments, experimental results show that compared with other methods, the improved sample utilization of this method greatly accelerates the learning speed, and the success rate is as high as 0.6.
computer science, artificial intelligence
What problem does this paper attempt to address?