Soft Actor-Critic with Inhibitory Networks for Faster Retraining

Jaime S. Ide,Daria Mićović,Michael J. Guarino,Kevin Alcedo,David Rosenbluth,Adrian P. Pope
DOI: https://doi.org/10.48550/arXiv.2202.02918
2022-02-08
Abstract:Reusing previously trained models is critical in deep reinforcement learning to speed up training of new agents. However, it is unclear how to acquire new skills when objectives and constraints are in conflict with previously learned skills. Moreover, when retraining, there is an intrinsic conflict between exploiting what has already been learned and exploring new skills. In soft actor-critic (SAC) methods, a temperature parameter can be dynamically adjusted to weight the action entropy and balance the explore $\times$ exploit trade-off. However, controlling a single coefficient can be challenging within the context of retraining, even more so when goals are contradictory. In this work, inspired by neuroscience research, we propose a novel approach using inhibitory networks to allow separate and adaptive state value evaluations, as well as distinct automatic entropy tuning. Ultimately, our approach allows for controlling inhibition to handle conflict between exploiting less risky, acquired behaviors and exploring novel ones to overcome more challenging tasks. We validate our method through experiments in OpenAI Gym environments.
Machine Learning,Artificial Intelligence,Neural and Evolutionary Computing
What problem does this paper attempt to address?
This paper attempts to solve the problem of retraining in deep reinforcement learning, especially how to balance the conflict between exploiting existing skills and exploring new skills when facing new goals and constraints. Specifically: 1. **Problem Background**: - In deep reinforcement learning, re - using a pre - trained model can accelerate the training of new agents. - However, when new goals and constraints conflict with previously learned skills, acquiring new skills becomes a challenge. - There is an inherent contradiction in the retraining process: it is necessary to both utilize the learned knowledge and explore new skills. 2. **Limitations of Existing Methods**: - In the Soft Actor - Critic (SAC) method, adjusting the temperature parameter to balance exploration and exploitation has limited effectiveness, especially during retraining, as a single coefficient is difficult to handle goal conflicts. 3. **Proposed New Method**: - Inspired by neuroscience research, the authors propose a method based on an inhibitory network (SAC - I), which allows for independent and adaptive evaluation of state values and provides different automatic entropy adjustment mechanisms. - By controlling inhibition, a better balance can be achieved between exploiting existing low - risk behaviors and exploring new skills to handle more challenging tasks. 4. **Validation Method**: - The authors verified the effectiveness of this method through experiments in the OpenAI Gym environment, especially for environments such as LunarLanderContinuous - v2 (with random bombs) and BipedalWalkerHardcore - v3. ### Main Contributions 1. **SAC - I Architecture**: - Developed the SAC - I architecture, which uses an inhibitory network to control multiple evaluation networks, thereby achieving faster retraining. - Modified the SAC method, including training multiple value functions, storing episodic replay buffers, estimating different temperature parameters, and learning an inhibitory strategy when necessary. 2. **Detailed Verification**: - Provided detailed verification results, showing the improvement of SAC - I in two modified OpenAI Gym environments, especially its better performance in handling conflicting goals and complex tasks compared to the traditional SAC method. ### Conclusion This paper proposes a novel SAC - I method, which solves the conflict between exploration and exploitation during retraining in deep reinforcement learning by introducing an inhibitory network, significantly improving the speed and effectiveness of retraining.