Abstract:This paper revisits the estimation bias control problem of Q-learning, motivated by the fact that the estimation bias is not always evil, i.e., some environments benefit from overestimation bias or underestimation bias, while others suffer from these biases. Different from previous coarse-grained bias control methods, this paper proposes a fine-grained bias control algorithm called Order Q-learning. It uses the order statistic of multiple independent Q-tables to control bias and flexibly meet the personalized bias needs of different environments, i.e., the bias can vary from underestimation bias to overestimation bias as one selects a higher order Q-value. We derive the expected estimation bias and its lower bound and upper bound. They reveal that the expected estimation bias is inversely proportional to the number of Q-tables and proportional to the index of order statistic function. To show the versatility of Order Q-learning, we design an adaptive parameter adjustment strategy, leading to AdaOrder (Adaptive Order) Q-learning. It adaptively selects the number of Q-tables and the index of order statistic function via the number of visits to state-action pair and the average Q-value. We extend Order Q-learning and AdaOrder Q-learning to the large scale setting with function approximation, leading to Order DQN and AdaOrder DQN, respectively. Finally, we consider two experiment settings: deep reinforcement learning experiments show that our method outperforms several SOTA baselines drastically; tabular MDP experiments reveal fundamental insights into why our method can achieve superior performance.Our supplementary file can be found in https://1drv.ms/f/s!Atddp1iaDmL2gjv31CaGquw5WwYI.

Adaptive Order Q-learning

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning

On the Estimation Bias in Double Q-Learning

Safe Reinforcement Learning Using Finite-Horizon Gradient-based Estimation

A controlling estimation bias method: Max_Mix_Min estimator for Q-learning

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Ensemble Bootstrapping for Q-Learning

Suppressing Overestimation in Q-Learning through Adversarial Behaviors

WD3: Taming the Estimation Bias in Deep Reinforcement Learning

Controlling Estimation Error in Reinforcement Learning via Reinforced Operation

Adaptive pessimism via target Q-value for offline reinforcement learning

Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples

Balanced Q-learning: Combining the Influence of Optimistic and Pessimistic Targets

Deep Reinforcement Learning with Double Q-Learning

Self-correcting Q-learning.

Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model

Mimicking Human Intuition: Cognitive Belief-Driven Q-Learning

Strategically Conservative Q-Learning