Abstract:The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in many real-world scenarios. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies. Soft actor-critic(SAC) is another method for improving exploration that aims to combine efficient learning via off-policy updates while maximizing the policy entropy. In this work, we extend SAC to a richer class of probability distributions (e.g., multimodal) through normalizing flows (NF) and show that this significantly improves performance by accelerating the discovery of good policies while using much smaller policy representations. Our approach, which we call SAC-NF, is a simple, efficient,easy-to-implement modification and improvement to SAC on continuous control baselines such as MuJoCo and PyBullet Roboschool domains. Finally, SAC-NF does this while being significantly parameter efficient, using as few as 5.5% the parameters for an equivalent SAC model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in domains with sparse rewards, how to discover near - optimal policies, which is the key to applying reinforcement learning (RL) to many real - world scenarios. Specifically, the author aims to extend the Soft Actor - Critic (SAC) algorithm by introducing normalizing flows (NF) to improve exploration efficiency and performance. ### Specific Problem Description 1. **Insufficient Exploration**: - In high - dimensional continuous - control tasks, existing RL algorithms often fail to fully explore the environment and are prone to getting trapped in local optimal solutions. For example, in robotic environments, there are many local minima, resulting in learned policies that are not robust enough. 2. **Limitations of SAC**: - SAC promotes exploration by maximizing policy entropy, but it is limited to modeling simple distributions with closed - form entropy (such as unimodal Gaussian distributions). This restricts its exploration ability, especially in tasks that require complex, multimodal distributions. 3. **Parameter Efficiency**: - The proposed method should not only outperform existing methods in performance but also have a significant improvement in parameter efficiency, that is, achieving or exceeding the effects of existing methods with fewer parameters. ### Solution To solve the above problems, the author proposes SAC - NF (Soft Actor - Critic with Normalizing Flows), and the main contributions include: - **Extending SAC's Policy Distribution**: By introducing normalizing flows (NF), SAC - NF can model richer probability distributions (such as multimodal distributions), thereby accelerating the discovery of better policies. - **Improving Exploration Efficiency**: The high - expressive power of NF enables SAC - NF to conduct more effective exploration in complex environments and avoid premature convergence to sub - optimal solutions. - **Parameter - Efficient**: Experiments show that SAC - NF can significantly reduce the number of model parameters while maintaining or even surpassing the performance of SAC. ### Experimental Verification The author verified the effectiveness of SAC - NF in multiple benchmark tasks, including MuJoCo simulator and PyBullet Roboschool tasks. The results show that SAC - NF is superior to SAC in both convergence speed and final performance, and exhibits higher parameter efficiency in some tasks. ### Summary The main objective of the paper is to extend SAC by introducing normalizing flows to solve the problem of insufficient exploration of existing RL algorithms in high - dimensional continuous - control tasks, and prove the advantages of this method in performance and parameter efficiency through experiments.

Leveraging exploration in off-policy algorithms via normalizing flows

A Scalable Derivative-free Exploration Approach for Reinforcement Learning

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past

Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

FlowPG: Action-constrained Policy Gradient with Normalizing Flows

Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow

Unified Policy Optimization for Continuous-action Reinforcement Learning in Non-stationary Tasks and Games

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

Improving exploration in policy gradient search: Application to symbolic optimization

Promoting Stochasticity for Expressive Policies Via a Simple and Efficient Regularization Method.

Careful at Estimation and Bold at Exploration

Generalizing soft actor-critic algorithms to discrete action spaces

Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration

Exploration in Feature Space for Reinforcement Learning

Leveraging Fully Observable Policies for Learning under Partial Observability