Abstract:The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in many real-world scenarios. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies. Soft actor-critic(SAC) is another method for improving exploration that aims to combine efficient learning via off-policy updates while maximizing the policy entropy. In this work, we extend SAC to a richer class of probability distributions (e.g., multimodal) through normalizing flows (NF) and show that this significantly improves performance by accelerating the discovery of good policies while using much smaller policy representations. Our approach, which we call SAC-NF, is a simple, efficient,easy-to-implement modification and improvement to SAC on continuous control baselines such as MuJoCo and PyBullet Roboschool domains. Finally, SAC-NF does this while being significantly parameter efficient, using as few as 5.5% the parameters for an equivalent SAC model.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in domains with sparse rewards, how to discover near - optimal policies, which is the key to applying reinforcement learning (RL) to many real - world scenarios. Specifically, the author aims to extend the Soft Actor - Critic (SAC) algorithm by introducing normalizing flows (NF) to improve exploration efficiency and performance.
### Specific Problem Description
1. **Insufficient Exploration**:
- In high - dimensional continuous - control tasks, existing RL algorithms often fail to fully explore the environment and are prone to getting trapped in local optimal solutions. For example, in robotic environments, there are many local minima, resulting in learned policies that are not robust enough.
2. **Limitations of SAC**:
- SAC promotes exploration by maximizing policy entropy, but it is limited to modeling simple distributions with closed - form entropy (such as unimodal Gaussian distributions). This restricts its exploration ability, especially in tasks that require complex, multimodal distributions.
3. **Parameter Efficiency**:
- The proposed method should not only outperform existing methods in performance but also have a significant improvement in parameter efficiency, that is, achieving or exceeding the effects of existing methods with fewer parameters.
### Solution
To solve the above problems, the author proposes SAC - NF (Soft Actor - Critic with Normalizing Flows), and the main contributions include:
- **Extending SAC's Policy Distribution**: By introducing normalizing flows (NF), SAC - NF can model richer probability distributions (such as multimodal distributions), thereby accelerating the discovery of better policies.
- **Improving Exploration Efficiency**: The high - expressive power of NF enables SAC - NF to conduct more effective exploration in complex environments and avoid premature convergence to sub - optimal solutions.
- **Parameter - Efficient**: Experiments show that SAC - NF can significantly reduce the number of model parameters while maintaining or even surpassing the performance of SAC.
### Experimental Verification
The author verified the effectiveness of SAC - NF in multiple benchmark tasks, including MuJoCo simulator and PyBullet Roboschool tasks. The results show that SAC - NF is superior to SAC in both convergence speed and final performance, and exhibits higher parameter efficiency in some tasks.
### Summary
The main objective of the paper is to extend SAC by introducing normalizing flows to solve the problem of insufficient exploration of existing RL algorithms in high - dimensional continuous - control tasks, and prove the advantages of this method in performance and parameter efficiency through experiments.