Abstract:We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically, we show that a smaller learning rate, or, an adaptive learning rate, such as that used by Adam and RSMProp optimizers, makes the policy optimization robust to the bias. We further draw connections between optimizers and the optimization regularization to show that both the KL and the reverse KL regularization can significantly rectify this bias. Moreover, we provide extensive experiments on continuous control tasks to support our analysis. Our paper sheds light on how successful PG algorithms optimize policies in the DRL setting, and contributes insights into the practical issues in DRL.

Reducing Estimation Bias Via Triplet-Average Deep Deterministic Policy Gradient.

Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

WD3: Taming the Estimation Bias in Deep Reinforcement Learning

Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning

On the Estimation Bias in Double Q-Learning

Softmax Deep Double Deterministic Policy Gradients

Alternated Greedy-Step Deterministic Policy Gradient

Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning

Regularly Updated Deterministic Policy Gradient Algorithm

Exponential Moving Averaged Q-Network for DDPG

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

Swap Softmax Twin Delayed Deep Deterministic Policy Gradient

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

Controlling Underestimation Bias in Reinforcement Learning Via Minmax Operation

An Overestimation Reduction Method Based on the Multi-step Weighted Double Estimation Using Value-Decomposition Multi-agent Reinforcement Learning

Better Value Estimation in Q-learning-based Multi-Agent Reinforcement Learning

Stochastic Double Deep Q-Network.

Controlling Estimation Error in Reinforcement Learning via Reinforced Operation

Adaptive Moving Average Q-learning