Abstract:We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the estimation bias problem in existing Q - learning methods, especially the uncontrollable over - estimation bias, which often leads to poor performance. Specifically, the paper proposes a new Q - learning variant, called 2RA Q - learning, which precisely controls the degree of estimation bias by introducing a distributionally robust estimator. This method allows researchers to flexibly adjust over - estimation and under - estimation biases in algorithm design, thereby improving the quality and performance of the learning strategy.
### Main Contributions
1. **Computational Cost**: A tractable form of the 2RA Q - learning algorithm is proposed, and its computational cost per iteration is comparable to that of Watkins' Q - learning.
2. **Convergence**: For any selection parameter \( N \) and regularization parameter sequence \( \{\rho_n\}_{n \in \mathbb{N}} \) (where \( \lim_{n \to \infty} \rho_n = 0 \)), it is proved that 2RA Q - learning asymptotically converges to the true Q - function.
3. **Estimation Bias Control**: It is shown how to control the estimation bias in 2RA Q - learning by selecting parameters \( \rho \) and \( N \), and when \( N \to \infty \), the proposed estimation scheme becomes unbiased.
4. **Mean - Square Error**: Under certain technical assumptions, it is proved that the asymptotic mean - square error of 2RA Q - learning is equal to that of Watkins' Q - learning, provided that the selected learning rate is \( N \) times that of Watkins' Q - learning.
5. **Numerical Experiments**: Through synthetic MDP settings and practical experiments in the OpenAI Gym suite, the theoretical properties of 2RA Q - learning are verified, and it shows good performance in practical applications, usually outperforming other Q - learning variants.
### Problem Background
In reinforcement learning, Q - learning is a widely used algorithm for learning the optimal policy of Markov decision processes (MDPs). However, the standard Q - learning has the problem of over - estimation bias, which affects the quality of the learning strategy. To alleviate this problem, researchers have proposed various improvement methods, such as Double Q - learning and Maxmin Q - learning, but these methods have their own advantages and disadvantages. For example, although Double Q - learning avoids over - estimation bias, it introduces under - estimation bias.
### Core Idea of 2RA Q - learning
2RA Q - learning precisely controls the estimation bias by introducing a distributionally robust estimator. Specifically, the algorithm uses two parameters:
- \( \rho>0 \): Quantifies the degree of introduced robustness/regularization.
- \( N \in \mathbb{N} \): Describes the number of state - action estimates used to form the empirical average.
In this way, 2RA Q - learning can effectively control the estimation bias while maintaining computational efficiency, thereby improving the performance of the learning strategy.
### Mathematical Expression
The update rule of 2RA Q - learning is as follows:
\[ \theta_n^{(i)}=\theta_n^{(i)}+\alpha_n \beta_n^{(i)}\left(b(X_n)-A_1(X_n) \theta_n^{(i)}+E_\rho(X_n, S_{n + 1}, \bar{\theta}_N^n)\right), \quad i = 1, \ldots, N, \]
where:
- \( \beta_n \) is a generalized independent and identically distributed Bernoulli random variable, taking values in \( \{1, \ldots, N\} \), and the probability of each component \( i \) is \( \frac{1}{N} \).
- \( \bar{\theta}_N^n=\frac{1}{