Abstract:Q-learning (QL), a common reinforcement learning algorithm, suffers from over-estimation bias due to the maximization term in the optimal Bellman operator. This bias may lead to sub-optimal behavior. Double-Q-learning tackles this issue by utilizing two estimators, yet results in an under-estimation bias. Similar to over-estimation in Q-learning, in certain scenarios, the under-estimation bias may degrade performance. In this work, we introduce a new bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL), a natural extension of Double-Q-learning to ensembles. We analyze our method both theoretically and empirically. Theoretically, we prove that EBQL-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. Finally, We demonstrate the superior performance of a deep RL variant of EBQL over other deep QL algorithms for a suite of ATARI games.

What problem does this paper attempt to address?

This paper attempts to solve the problem of estimation bias in the Q - learning (QL) algorithm. Specifically, the QL algorithm has an over - estimation bias due to maximizing the maximum term in the Bellman operator, which may lead to sub - optimal behavior. Although Double Q - learning (DQL) solves this problem by using two estimators, it introduces an under - estimation bias, which can also lead to performance degradation in some cases. To solve these problems, the authors propose a new bias - reduction algorithm - Ensemble Bootstrapped Q - Learning (EBQL), which is a natural extension of DQL. EBQL estimates the maximum mean more accurately by dividing samples into multiple sets, thus showing lower mean - squared error (MSE) both theoretically and empirically. In addition, the experimental results show that in a series of ATARI games, the EBQL variant based on deep learning performs better than other deep QL algorithms. ### Main Contributions 1. **Analysis and Proof**: The authors analyze the maximum expected estimation problem of independent random variables and prove that using an ensemble estimator can reduce the mean - squared error (MSE). They also show that more than two ensemble members are required to obtain the minimum MSE. 2. **Propose a New Method**: Inspired by the above analysis, the authors propose Ensemble Bootstrapped Q - Learning (EBQL) and show how it reduces the bootstrap estimation bias. 3. **Experimental Proof of Superiority**: The authors show the superiority of EBQL over Q - learning and Double Q - learning in tabular environments and ATARI games combined with deep neural networks. ### Problems Solved - **Over - Estimation Bias**: In the traditional QL algorithm, due to the existence of maximizing the Bellman operator, over - estimation bias is likely to occur, leading to sub - optimal behavior. - **Under - Estimation Bias**: Although DQL solves the over - estimation problem, it introduces an under - estimation bias, which may reduce performance in some scenarios. - **Comprehensive Balance**: EBQL can better balance optimistic and pessimistic estimates in different environments by integrating multiple estimators, thus improving overall performance. ### Experimental Verification - **Meta Chain MDP Experiment**: In the Meta Chain MDP environment, EBQL shows its robustness in the case of positive and negative reward means, and its performance improves as the integration scale increases. - **ATARI Game Experiment**: In a series of ATARI games, the performance of EBQL is better than that of DQN and DDQN, especially in games such as Asterix and Crazy Climber. In conclusion, this paper provides an effective method to reduce the estimation bias in Q - learning by introducing EBQL, thereby improving the performance and stability of reinforcement learning algorithms.

Ensemble Bootstrapping for Q-Learning

On the Estimation Bias in Double Q-Learning

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Deep Reinforcement Learning with Double Q-Learning

Adaptive Order Q-learning

Self-correcting Q-learning.

Shared Learning : Enhancing Reinforcement in $Q$-Ensembles

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Addressing Maximization Bias in Reinforcement Learning with Two-Sample Testing

Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples

A controlling estimation bias method: Max_Mix_Min estimator for Q-learning

Regularized Softmax Deep Multi-Agent Q-Learning.

Double Successive Over-Relaxation Q-Learning with an Extension to Deep Reinforcement Learning

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Finite-Time Analysis of Simultaneous Double Q-learning

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning