Abstract:This paper revisits the estimation bias control problem of Q-learning, motivated by the fact that the estimation bias is not always evil, i.e., some environments benefit from overestimation bias or underestimation bias, while others suffer from these biases. Different from previous coarse-grained bias control methods, this paper proposes a fine-grained bias control algorithm called Order Q-learning. It uses the order statistic of multiple independent Q-tables to control bias and flexibly meet the personalized bias needs of different environments, i.e., the bias can vary from underestimation bias to overestimation bias as one selects a higher order Q-value. We derive the expected estimation bias and its lower bound and upper bound. They reveal that the expected estimation bias is inversely proportional to the number of Q-tables and proportional to the index of order statistic function. To show the versatility of Order Q-learning, we design an adaptive parameter adjustment strategy, leading to AdaOrder (Adaptive Order) Q-learning. It adaptively selects the number of Q-tables and the index of order statistic function via the number of visits to state-action pair and the average Q-value. We extend Order Q-learning and AdaOrder Q-learning to the large scale setting with function approximation, leading to Order DQN and AdaOrder DQN, respectively. Finally, we consider two experiment settings: deep reinforcement learning experiments show that our method outperforms several SOTA baselines drastically; tabular MDP experiments reveal fundamental insights into why our method can achieve superior performance.Our supplementary file can be found in https://1drv.ms/f/s!Atddp1iaDmL2gjv31CaGquw5WwYI.

Approximation Error Back-Propagation for Q-Function in Scalable Reinforcement Learning with Tree Dependence Structure

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Understanding Deep Neural Function Approximation in Reinforcement Learning via $ε$-Greedy Exploration

TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning

Branching Reinforcement Learning

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation

Provably Efficient Q-learning with Function Approximation Via Distribution Shift Error Checking Oracle

Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

Improve Value Estimation of Q Function and Reshape Reward with Monte Carlo Tree Search

Conservative Q-Improvement: Reinforcement Learning for an Interpretable Decision-Tree Policy

Research on Knowledge Graph Completion Model Combining Temporal Convolutional Network and Monte Carlo Tree Search

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

Adaptive Order Q-learning

Non-stationary Reinforcement Learning under General Function Approximation

Constant Stepsize Q-learning: Distributional Convergence, Bias and Extrapolation