Abstract:This paper revisits the estimation bias control problem of Q-learning, motivated by the fact that the estimation bias is not always evil, i.e., some environments benefit from overestimation bias or underestimation bias, while others suffer from these biases. Different from previous coarse-grained bias control methods, this paper proposes a fine-grained bias control algorithm called Order Q-learning. It uses the order statistic of multiple independent Q-tables to control bias and flexibly meet the personalized bias needs of different environments, i.e., the bias can vary from underestimation bias to overestimation bias as one selects a higher order Q-value. We derive the expected estimation bias and its lower bound and upper bound. They reveal that the expected estimation bias is inversely proportional to the number of Q-tables and proportional to the index of order statistic function. To show the versatility of Order Q-learning, we design an adaptive parameter adjustment strategy, leading to AdaOrder (Adaptive Order) Q-learning. It adaptively selects the number of Q-tables and the index of order statistic function via the number of visits to state-action pair and the average Q-value. We extend Order Q-learning and AdaOrder Q-learning to the large scale setting with function approximation, leading to Order DQN and AdaOrder DQN, respectively. Finally, we consider two experiment settings: deep reinforcement learning experiments show that our method outperforms several SOTA baselines drastically; tabular MDP experiments reveal fundamental insights into why our method can achieve superior performance.Our supplementary file can be found in https://1drv.ms/f/s!Atddp1iaDmL2gjv31CaGquw5WwYI.

Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

On the Estimation Bias in Double Q-Learning

WD3: Taming the Estimation Bias in Deep Reinforcement Learning

Controlling Estimation Error in Reinforcement Learning via Reinforced Operation

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy

Softmax Deep Double Deterministic Policy Gradients

Addressing Function Approximation Error in Actor-Critic Methods

Adaptive Order Q-learning

Actor-Critic With Synthesis Loss for Solving Approximation Biases

Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Stochastic Variance Reduction for Deep Q-learning

An Overestimation Reduction Method Based on the Multi-step Weighted Double Estimation Using Value-Decomposition Multi-agent Reinforcement Learning

Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Variance Reduced Domain Randomization for Reinforcement Learning With Policy Gradient

Actor-Critic Reinforcement Learning with Phased Actor

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

Deterministic Value-Policy Gradients

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages