Abstract:We address the challenge of effective exploration while maintaining good performance in policy gradient methods. As a solution, we propose diverse exploration (DE) via conjugate policies. DE learns and deploys a set of conjugate policies which can be conveniently generated as a byproduct of conjugate gradient descent. We provide both theoretical and empirical results showing the effectiveness of DE at achieving exploration, improving policy performance, and the advantage of DE over exploration by random policy perturbations.
What problem does this paper attempt to address?
This paper attempts to address the challenge between effective exploration and performance maintenance in the Policy Gradient (PG) method. Specifically, it aims to improve the exploration ability of the Policy Gradient method through Diverse Exploration (DE), while ensuring that the performance of the policy will not be significantly degraded.
### Problem Background
In Reinforcement Learning (RL), Policy Gradient methods such as TRPO (Trust Region Policy Optimization) can train large neural networks as policy function approximators, but they usually have problems of slow convergence and low data utilization efficiency, mainly due to the lack of an effective exploration mechanism. Traditional exploration strategies (such as ε - greedy or R - MAX) can achieve a certain degree of exploration, but cannot guarantee the performance of the behavior policy.
### Paper Solution
To solve this problem, this paper proposes a method of diverse exploration through Conjugate Policies. Specifically:
1. **Diverse Exploration (DE)**: By learning and deploying a set of Conjugate Policies, these policies can be generated within the local area of the policy space and can be conveniently generated through Conjugate Gradient Descent (CGD).
2. **Theoretical and Empirical Results**: The paper provides theoretical and empirical results, demonstrating the advantages of DE in achieving exploration, improving policy performance, and relative to random policy perturbations.
### Key Contributions
1. **Proposing the DE Solution**: Achieving diverse exploration in the Natural Policy Gradient (NPG) method through Conjugate Policies. DE learns and deploys a set of Conjugate Policies in the local area of the policy space and follows the natural gradient descent direction in each policy improvement iteration.
2. **Theoretical Explanation**: Explaining why diverse exploration through Conjugate Policies is effective in the NPG method. Theoretical results show that:
- Maximizing the KL - divergence (Kullback - Leibler Divergence) between perturbed policies can reduce the variance of perturbed gradient estimates, thereby improving the accuracy of policy updates.
- The Conjugate Policies generated by conjugate vectors maximize the pairwise KL - divergence among a limited number of perturbations.
3. **Algorithm Framework**: Developing a general algorithm framework for achieving diverse exploration in the NPG method through Conjugate Policies. This algorithm efficiently generates Conjugate Policies using the conjugate vectors generated when calculating the natural gradient descent direction in each policy improvement iteration.
### Experimental Results
Based on the experimental results of TRPO on three continuous control tasks (Hopper, Walker, and HalfCheetah), TRPO with DE significantly outperforms the baseline TRPO and TRPO with random perturbations (RP). Specifically:
- **Performance Improvement**: DE not only improves the performance of the main policy but also finds better policies through more extensive exploration.
- **Variance Reduction**: DE shows smaller variance in policy performance, indicating that it can more consistently escape from local optima.
- **Exploration Diversity**: The KL - divergence between the perturbed policies generated by DE is larger, verifying its advantage in exploration diversity.
In conclusion, this paper achieves diverse exploration in the Policy Gradient method by introducing Conjugate Policies, resolves the contradiction between effective exploration and performance maintenance, and thus improves the overall performance of reinforcement learning algorithms.