Bounded Exploration with World Model Uncertainty in Soft Actor-Critic Reinforcement Learning Algorithm

Ting Qiao,Henry Williams,David Valencia,Bruce MacDonald
2024-12-09
Abstract:One of the bottlenecks preventing Deep Reinforcement Learning algorithms (DRL) from real-world applications is how to explore the environment and collect informative transitions efficiently. The present paper describes bounded exploration, a novel exploration method that integrates both 'soft' and intrinsic motivation exploration. Bounded exploration notably improved the Soft Actor-Critic algorithm's performance and its model-based extension's converging speed. It achieved the highest score in 6 out of 8 experiments. Bounded exploration presents an alternative method to introduce intrinsic motivations to exploration when the original reward function has strict meanings.
Machine Learning,Systems and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to efficiently explore the environment and collect useful state - transfer information in the Deep Reinforcement Learning (DRL) algorithm, especially improving data efficiency in practical applications. Specifically, the paper proposes a new exploration method - **Bounded Exploration**, aiming to combine the advantages of "soft" exploration and intrinsic - motivation exploration to improve the performance of the Soft Actor - Critic (SAC) algorithm and the convergence speed of its model - based extension. ### Specific Background of the Problem 1. **Data Efficiency Problem**: - The main obstacle faced by Model - Free Reinforcement Learning (MFRL) algorithms when solving complex real - world problems is low data efficiency. Training an MFRL agent may require millions of interactions with the environment to learn control skills. - The mechanical parts in robots are prone to wear, so how to actively explore uncertain areas and avoid repeated exploration becomes crucial. 2. **Exploration Strategies**: - Existing exploration methods are mainly divided into two categories: one is to encourage exploration by introducing a policy entropy term (such as soft RL), and the other is to encourage exploration by using information - theoretic metrics (such as mutual information) as rewards. - These two types of methods are usually considered separately, and few people study how to combine them. 3. **Uncertainty and Intrinsic Motivation**: - Recent research shows that the degree of an agent's uncertainty about future states can be determined by quantifying the uncertainty of the world model and used as intrinsic motivation. - The traditional approach is to add uncertainty to the reward function to encourage exploration, but this may cause the agent to over - utilize the exploration reward, thus showing risk - preference behavior. ### Proposed Solution The paper proposes a new exploration strategy - **Bounded Exploration**, which explores by selecting high - uncertainty actions in the soft - policy distribution without changing the original reward function. The specific steps are as follows: 1. **Prepare Candidate Actions**: - Use the SAC algorithm to generate a series of action candidates, and these candidate actions come from the soft - policy distribution. 2. **Estimate Uncertainty**: - Input these candidate actions and states into a set of world models and calculate the uncertainty of each action. Uncertainty is measured by the variance of prediction results. 3. **Select Actions with High Uncertainty**: - Construct a probability distribution according to the uncertainty measure and select the action with the highest uncertainty to execute. ### Experimental Results The experimental results show that Bounded Exploration achieved the highest score in 6/8 of the experiments, especially significantly improving the learning speed and data efficiency in some environments. However, in some complex environments, the performance of Bounded Exploration is not ideal, which indicates that the generalization ability of this method in different environments needs to be improved. ### Summary This paper proposes a new method - Bounded Exploration - that combines soft RL and intrinsic - motivation exploration, aiming to improve data efficiency and the performance of the SAC algorithm. Although it performs poorly in some complex environments, this method shows significant advantages in many cases and has potential practical application value.