Abstract:Introduction: The value approximation bias is known to lead to suboptimal policies or catastrophic overestimation bias accumulation that prevent the agent from making the right decisions between exploration and exploitation. Algorithms have been proposed to mitigate the above contradiction. However, we still lack an understanding of how the value bias impact performance and a method for efficient exploration while keeping stable updates. This study aims to clarify the effect of the value bias and improve the reinforcement learning algorithms to enhance sample efficiency. Methods: This study designs a simple episodic tabular MDP to research value underestimation and overestimation in actor-critic methods. This study proposes a unified framework called Realistic Actor-Critic (RAC), which employs Universal Value Function Approximators (UVFA) to simultaneously learn policies with different value confidence-bound with the same neural network, each with a different under overestimation trade-off. Results: This study highlights that agents could over-explore low-value states due to inflexible under-overestimation trade-off in the fixed hyperparameters setting, which is a particular form of the exploration-exploitation dilemma. And RAC performs directed exploration without over-exploration using the upper bounds while still avoiding overestimation using the lower bounds. Through carefully designed experiments, this study empirically verifies that RAC achieves 10x sample efficiency and 25% performance improvement compared to Soft Actor-Critic in the most challenging Humanoid environment. All the source codes are available at https://github.com/ihuhuhu/RAC. Discussion: This research not only provides valuable insights for research on the exploration-exploitation trade-off by studying the frequency of policies access to low-value states under different value confidence-bounds guidance, but also proposes a new unified framework that can be combined with current actor-critic methods to improve sample efficiency in the continuous control domain.

Addressing Function Approximation Error in Actor-Critic Methods

Actor-Critic With Synthesis Loss for Solving Approximation Biases

On the sample complexity of actor-critic method for reinforcement learning with function approximation

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Compatible Gradient Approximations for Actor-Critic Algorithms

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Realistic Actor-Critic: A framework for balance between value overestimation and underestimation

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

PAC-Bayesian Soft Actor-Critic Learning

Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation

Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods

A double Actor-Critic learning system embedding improved Monte Carlo tree search

Resilient Consensus-based Multi-agent Reinforcement Learning with Function Approximation

Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning

Efficient Continuous Control with Double Actors and Regularized Critics

Controlling Estimation Error in Reinforcement Learning via Reinforced Operation