Abstract:Reusing previously trained models is critical in deep reinforcement learning to speed up training of new agents. However, it is unclear how to acquire new skills when objectives and constraints are in conflict with previously learned skills. Moreover, when retraining, there is an intrinsic conflict between exploiting what has already been learned and exploring new skills. In soft actor-critic (SAC) methods, a temperature parameter can be dynamically adjusted to weight the action entropy and balance the explore $\times$ exploit trade-off. However, controlling a single coefficient can be challenging within the context of retraining, even more so when goals are contradictory. In this work, inspired by neuroscience research, we propose a novel approach using inhibitory networks to allow separate and adaptive state value evaluations, as well as distinct automatic entropy tuning. Ultimately, our approach allows for controlling inhibition to handle conflict between exploiting less risky, acquired behaviors and exploring novel ones to overcome more challenging tasks. We validate our method through experiments in OpenAI Gym environments.

What problem does this paper attempt to address?

This paper attempts to solve the problem of retraining in deep reinforcement learning, especially how to balance the conflict between exploiting existing skills and exploring new skills when facing new goals and constraints. Specifically: 1. **Problem Background**: - In deep reinforcement learning, re - using a pre - trained model can accelerate the training of new agents. - However, when new goals and constraints conflict with previously learned skills, acquiring new skills becomes a challenge. - There is an inherent contradiction in the retraining process: it is necessary to both utilize the learned knowledge and explore new skills. 2. **Limitations of Existing Methods**: - In the Soft Actor - Critic (SAC) method, adjusting the temperature parameter to balance exploration and exploitation has limited effectiveness, especially during retraining, as a single coefficient is difficult to handle goal conflicts. 3. **Proposed New Method**: - Inspired by neuroscience research, the authors propose a method based on an inhibitory network (SAC - I), which allows for independent and adaptive evaluation of state values and provides different automatic entropy adjustment mechanisms. - By controlling inhibition, a better balance can be achieved between exploiting existing low - risk behaviors and exploring new skills to handle more challenging tasks. 4. **Validation Method**: - The authors verified the effectiveness of this method through experiments in the OpenAI Gym environment, especially for environments such as LunarLanderContinuous - v2 (with random bombs) and BipedalWalkerHardcore - v3. ### Main Contributions 1. **SAC - I Architecture**: - Developed the SAC - I architecture, which uses an inhibitory network to control multiple evaluation networks, thereby achieving faster retraining. - Modified the SAC method, including training multiple value functions, storing episodic replay buffers, estimating different temperature parameters, and learning an inhibitory strategy when necessary. 2. **Detailed Verification**: - Provided detailed verification results, showing the improvement of SAC - I in two modified OpenAI Gym environments, especially its better performance in handling conflicting goals and complex tasks compared to the traditional SAC method. ### Conclusion This paper proposes a novel SAC - I method, which solves the conflict between exploration and exploitation during retraining in deep reinforcement learning by introducing an inhibitory network, significantly improving the speed and effectiveness of retraining.

Soft Actor-Critic with Inhibitory Networks for Faster Retraining

Self-play Reinforcement Learning with Comprehensive Critic in Computer Games

PAC-Bayesian Soft Actor-Critic Learning

Bayesian Soft Actor-Critic: A Directed Acyclic Strategy Graph Based Deep Reinforcement Learning

Revisiting Discrete Soft Actor-Critic

Generalizing soft actor-critic algorithms to discrete action spaces

Bayesian Strategy Networks Based Soft Actor-Critic Learning

OPAC: Opportunistic Actor-Critic

A Strategy-Oriented Bayesian Soft Actor-Critic Model

Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past

Decomposed Soft Actor-Critic Method for Cooperative Multi-Agent Reinforcement Learning

Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

Soft Actor-Critic Algorithm with Truly-satisfied Inequality Constraint

An improved Soft Actor-Critic strategy for optimal energy management

Robot Skill Adaptation via Soft Actor-Critic Gaussian Mixture Models

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

Explorer-Actor-Critic: Better Actors for Deep Reinforcement Learning

Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

Deep Exploration with PAC-Bayes

An Advanced Actor-Critic Algorithm for Training Video Game AI

How Do You Act? An Empirical Study to Understand Behavior of Deep Reinforcement Learning Agents