Abstract:We describe a novel extension of soft actor-critics for hierarchical Deep Q-Networks (HDQN) architectures using mutual information metric. The proposed extension provides a suitable framework for encouraging explorations in such hierarchical networks. A natural utilization of this framework is an adversarial setting, where meta-controller and controller play minimax over the mutual information objective but cooperate on maximizing expected rewards.
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to effectively encourage exploration in Hierarchical Reinforcement Learning (HRL) in environments with sparse reward feedback. Specifically, the authors propose a Hierarchical Soft Actor - Critic (HSAC) method based on mutual information optimization to promote exploration in the hierarchical network.
### Main Problems
1. **Exploration in Sparse - Reward Environments**:
- In environments with sparse reward feedback, exploration is one of the key challenges in designing data - efficient reinforcement learning algorithms. Traditional reinforcement learning frameworks often struggle to find effective strategies in these environments.
2. **Exploration Coordination in Hierarchical Reinforcement Learning**:
- Hierarchical reinforcement learning improves exploration efficiency by decomposing the problem into different levels of abstraction. However, how to ensure that the exploration of high - level controllers does not interfere with the meaningful exploration of low - level controllers, and vice versa, is an important issue.
### Solutions
To address the above problems, the paper proposes the following solutions:
1. **Maximum Entropy Reinforcement Learning (ME - RL)**:
- By introducing a maximum entropy term to encourage controllers to explore more. The maximum entropy term increases the randomness of the policy, thus encouraging broader exploration.
2. **Mutual Information Reinforcement Learning (MI - RL)**:
- Use the mutual information metric to decouple the exploration between different - level controllers. Specifically, the objective function of the controller is modified to minimize the mutual information \(I(a; g|s)\) between actions and sub - goals, which helps ensure that the exploration of high - level and low - level controllers is independent of each other.
3. **Adversarial Exploration Mechanism**:
- Introduce an adversarial setting, in which the meta - controller and the controller play a minimax game on a mutual information objective, but cooperate in maximizing the expected reward. This setting can further promote meaningful exploration.
### Formula Representation
- **ME - RL Objective Function**:
\[
J(\pi_g)=\sum_{t = 0}^{T}\mathbb{E}_{(s_t,g_t)\sim\rho^{\pi_g}}\left[r(s_t,g_t)+\alpha H(\pi_g(\cdot|s_t))\right]
\]
\[
J(\pi_{ag})=\sum_{t = 0}^{T}\mathbb{E}_{(s_t,a_t)\sim\rho^{\pi_{ag}}}\left[r(s_t,a_t|g_t)+\alpha H(\pi_{ag}(\cdot|s_t,g_t))\right]
\]
- **MI - RL Controller Objective Function**:
\[
J(\pi_{ag})=\sum_{t = 0}^{T}\mathbb{E}_{(s_t,a_t)\sim\rho^{\pi_{ag}}}\left[r(g_t,s_t,a_t)-\alpha I(a_t;g_t|s_t)\right]
\]
- **Adversarial MI - HRL**:
\[
\min_{\pi_{ag}}\max_{\pi_g}H(\pi_a(\cdot|s)) - H(\pi_{ag}(\cdot|s,g))
\]
Through these methods, the paper provides a novel direction to encourage exploration in hierarchical reinforcement learning and demonstrates its effectiveness in discrete - state MDP experiments.