Abstract:We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the estimation bias problem in existing Q - learning methods, especially the uncontrollable over - estimation bias, which often leads to poor performance. Specifically, the paper proposes a new Q - learning variant, called 2RA Q - learning, which precisely controls the degree of estimation bias by introducing a distributionally robust estimator. This method allows researchers to flexibly adjust over - estimation and under - estimation biases in algorithm design, thereby improving the quality and performance of the learning strategy. ### Main Contributions 1. **Computational Cost**: A tractable form of the 2RA Q - learning algorithm is proposed, and its computational cost per iteration is comparable to that of Watkins' Q - learning. 2. **Convergence**: For any selection parameter \( N \) and regularization parameter sequence \( \{\rho_n\}_{n \in \mathbb{N}} \) (where \( \lim_{n \to \infty} \rho_n = 0 \)), it is proved that 2RA Q - learning asymptotically converges to the true Q - function. 3. **Estimation Bias Control**: It is shown how to control the estimation bias in 2RA Q - learning by selecting parameters \( \rho \) and \( N \), and when \( N \to \infty \), the proposed estimation scheme becomes unbiased. 4. **Mean - Square Error**: Under certain technical assumptions, it is proved that the asymptotic mean - square error of 2RA Q - learning is equal to that of Watkins' Q - learning, provided that the selected learning rate is \( N \) times that of Watkins' Q - learning. 5. **Numerical Experiments**: Through synthetic MDP settings and practical experiments in the OpenAI Gym suite, the theoretical properties of 2RA Q - learning are verified, and it shows good performance in practical applications, usually outperforming other Q - learning variants. ### Problem Background In reinforcement learning, Q - learning is a widely used algorithm for learning the optimal policy of Markov decision processes (MDPs). However, the standard Q - learning has the problem of over - estimation bias, which affects the quality of the learning strategy. To alleviate this problem, researchers have proposed various improvement methods, such as Double Q - learning and Maxmin Q - learning, but these methods have their own advantages and disadvantages. For example, although Double Q - learning avoids over - estimation bias, it introduces under - estimation bias. ### Core Idea of 2RA Q - learning 2RA Q - learning precisely controls the estimation bias by introducing a distributionally robust estimator. Specifically, the algorithm uses two parameters: - \( \rho>0 \): Quantifies the degree of introduced robustness/regularization. - \( N \in \mathbb{N} \): Describes the number of state - action estimates used to form the empirical average. In this way, 2RA Q - learning can effectively control the estimation bias while maintaining computational efficiency, thereby improving the performance of the learning strategy. ### Mathematical Expression The update rule of 2RA Q - learning is as follows: \[ \theta_n^{(i)}=\theta_n^{(i)}+\alpha_n \beta_n^{(i)}\left(b(X_n)-A_1(X_n) \theta_n^{(i)}+E_\rho(X_n, S_{n + 1}, \bar{\theta}_N^n)\right), \quad i = 1, \ldots, N, \] where: - \( \beta_n \) is a generalized independent and identically distributed Bernoulli random variable, taking values in \( \{1, \ldots, N\} \), and the probability of each component \( i \) is \( \frac{1}{N} \). - \( \bar{\theta}_N^n=\frac{1}{

Regularized Q-learning through Robust Averaging

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient.

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Robust $Q$-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty

Constant Stepsize Q-learning: Distributional Convergence, Bias and Extrapolation

Two-Step Q-Learning

Robust Q-Learning for finite ambiguity sets

Regularized Q-Learning with Linear Function Approximation

A Finite Sample Complexity Bound for Distributionally Robust Q-learning

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Ensemble Bootstrapping for Q-Learning

Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula

Robust Reinforcement Learning with Distributional Risk-averse formulation

Improving Robustness via Risk Averse Distributional Reinforcement Learning

Model-Free Robust Average-Reward Reinforcement Learning

Expected Lenient Q-learning: a fast variant of the Lenient Q-learning algorithm for cooperative stochastic Markov games

Single-Trajectory Distributionally Robust Reinforcement Learning

Controlling Estimation Error in Reinforcement Learning via Reinforced Operation

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning