Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards

Siyuan Li,Rui Wang,Minxue Tang,Chongjie Zhang
DOI: https://doi.org/10.48550/arXiv.1910.04450
2019-10-10
Abstract:Hierarchical Reinforcement Learning (HRL) is a promising approach to solving long-horizon problems with sparse and delayed rewards. Many existing HRL algorithms either use pre-trained low-level skills that are unadaptable, or require domain-specific information to define low-level rewards. In this paper, we aim to adapt low-level skills to downstream tasks while maintaining the generality of reward design. We propose an HRL framework which sets auxiliary rewards for low-level skill training based on the advantage function of the high-level policy. This auxiliary reward enables efficient, simultaneous learning of the high-level policy and low-level skills without using task-specific knowledge. In addition, we also theoretically prove that optimizing low-level skills with this auxiliary reward will increase the task return for the joint policy. Experimental results show that our algorithm dramatically outperforms other state-of-the-art HRL methods in Mujoco domains. We also find both low-level and high-level policies trained by our algorithm transferable.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is a major challenge faced by Reinforcement Learning (RL) in long - term tasks: how to effectively learn policies in the case of sparse and delayed rewards. Specifically, the paper focuses on Hierarchical Reinforcement Learning (HRL) methods, which address this challenge by constructing multi - time - scale control structures. However, existing HRL algorithms have two main problems: 1. **Insufficient adaptability of pre - trained low - level skills**: Many HRL algorithms use pre - trained low - level skills, which are difficult to adapt in downstream tasks because the pre - trained reward function may be inconsistent with the actual task. 2. **Requiring domain - specific knowledge to define low - level rewards**: Some HRL algorithms require domain - specific information to design low - level rewards, which limits their generality. To overcome these problems, the paper proposes a new HRL framework - Hierarchical Reinforcement Learning with Advantage - based Auxiliary Rewards (HAAR). The main contributions of this framework include: - **Setting auxiliary rewards based on the advantage function of high - order policies**: By using the advantage function of high - order policies to set auxiliary rewards for low - level skills, low - level skills can be effectively learned without relying on task - specific knowledge. - **Simultaneously learning high - order and low - order policies**: The HAAR framework allows high - order and low - order policies to be learned simultaneously without pre - training low - level skills, thereby improving learning efficiency. - **Theoretical proof**: The paper also theoretically proves that optimizing low - level skills can increase the task rewards of the joint policy, and this method inherits the monotonic improvement property. ### Formula Explanation - **Advantage Function**: \[ A_h(s_h, a_h)=\mathbb{E}_{s_h^{t + k}\sim(\pi_h,\pi_l)}\left[r_h^t+\gamma_h V_h(s_h^{t + k})\mid a_h^t = a_h, s_h^t = s_h\right]-V_h(s_h) \] - **Low - level Auxiliary Reward**: \[ R_{l}^{s_h^t,a_h^t}(s_l^t,\ldots,s_l^{t + k-1})=A_h(s_h^t,a_h^t) \] After simplification, the auxiliary reward for each low - level step is: \[ r_l^t=\frac{r_h^t+\gamma_h V_h(s_h^{t + k})-V_h(s_h^t)}{k} \] - **Skill Length Annealing**: \[ k_i=\max(k_1e^{-\tau i},k_s) \] ### Experimental Results The paper conducted experiments on Mujoco benchmark tasks, and the results show that HAAR significantly outperforms other state - of - the - art HRL methods. Specifically: - **Faster learning speed**: HAAR exhibits a faster learning speed and higher convergence value in all tasks. - **The effect of skill length annealing**: Skill length annealing helps to accelerate the learning process, but has little impact on the final training results. - **Transferability of policies**: Experiments show that the high - order and low - order policies trained by HAAR can be transferred to similar new tasks, further verifying the effectiveness of the method. In conclusion, by introducing advantage - based auxiliary rewards, this paper solves the problems of adaptability and generality of low - level skills in HRL, and provides an effective solution for solving the problems of sparse and delayed rewards in long - term tasks.