Abstract:Hierarchical Reinforcement Learning (HRL) is a promising approach to solving long-horizon problems with sparse and delayed rewards. Many existing HRL algorithms either use pre-trained low-level skills that are unadaptable, or require domain-specific information to define low-level rewards. In this paper, we aim to adapt low-level skills to downstream tasks while maintaining the generality of reward design. We propose an HRL framework which sets auxiliary rewards for low-level skill training based on the advantage function of the high-level policy. This auxiliary reward enables efficient, simultaneous learning of the high-level policy and low-level skills without using task-specific knowledge. In addition, we also theoretically prove that optimizing low-level skills with this auxiliary reward will increase the task return for the joint policy. Experimental results show that our algorithm dramatically outperforms other state-of-the-art HRL methods in Mujoco domains. We also find both low-level and high-level policies trained by our algorithm transferable.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is a major challenge faced by Reinforcement Learning (RL) in long - term tasks: how to effectively learn policies in the case of sparse and delayed rewards. Specifically, the paper focuses on Hierarchical Reinforcement Learning (HRL) methods, which address this challenge by constructing multi - time - scale control structures. However, existing HRL algorithms have two main problems:
1. **Insufficient adaptability of pre - trained low - level skills**: Many HRL algorithms use pre - trained low - level skills, which are difficult to adapt in downstream tasks because the pre - trained reward function may be inconsistent with the actual task.
2. **Requiring domain - specific knowledge to define low - level rewards**: Some HRL algorithms require domain - specific information to design low - level rewards, which limits their generality.
To overcome these problems, the paper proposes a new HRL framework - Hierarchical Reinforcement Learning with Advantage - based Auxiliary Rewards (HAAR). The main contributions of this framework include:
- **Setting auxiliary rewards based on the advantage function of high - order policies**: By using the advantage function of high - order policies to set auxiliary rewards for low - level skills, low - level skills can be effectively learned without relying on task - specific knowledge.
- **Simultaneously learning high - order and low - order policies**: The HAAR framework allows high - order and low - order policies to be learned simultaneously without pre - training low - level skills, thereby improving learning efficiency.
- **Theoretical proof**: The paper also theoretically proves that optimizing low - level skills can increase the task rewards of the joint policy, and this method inherits the monotonic improvement property.
### Formula Explanation
- **Advantage Function**:
\[
A_h(s_h, a_h)=\mathbb{E}_{s_h^{t + k}\sim(\pi_h,\pi_l)}\left[r_h^t+\gamma_h V_h(s_h^{t + k})\mid a_h^t = a_h, s_h^t = s_h\right]-V_h(s_h)
\]
- **Low - level Auxiliary Reward**:
\[
R_{l}^{s_h^t,a_h^t}(s_l^t,\ldots,s_l^{t + k-1})=A_h(s_h^t,a_h^t)
\]
After simplification, the auxiliary reward for each low - level step is:
\[
r_l^t=\frac{r_h^t+\gamma_h V_h(s_h^{t + k})-V_h(s_h^t)}{k}
\]
- **Skill Length Annealing**:
\[
k_i=\max(k_1e^{-\tau i},k_s)
\]
### Experimental Results
The paper conducted experiments on Mujoco benchmark tasks, and the results show that HAAR significantly outperforms other state - of - the - art HRL methods. Specifically:
- **Faster learning speed**: HAAR exhibits a faster learning speed and higher convergence value in all tasks.
- **The effect of skill length annealing**: Skill length annealing helps to accelerate the learning process, but has little impact on the final training results.
- **Transferability of policies**: Experiments show that the high - order and low - order policies trained by HAAR can be transferred to similar new tasks, further verifying the effectiveness of the method.
In conclusion, by introducing advantage - based auxiliary rewards, this paper solves the problems of adaptability and generality of low - level skills in HRL, and provides an effective solution for solving the problems of sparse and delayed rewards in long - term tasks.