Regularized Q-learning through Robust Averaging

Peter Schmitt-Förster,Tobias Sutter
2024-05-29
Abstract:We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.
Optimization and Control,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the estimation bias problem in existing Q - learning methods, especially the uncontrollable over - estimation bias, which often leads to poor performance. Specifically, the paper proposes a new Q - learning variant, called 2RA Q - learning, which precisely controls the degree of estimation bias by introducing a distributionally robust estimator. This method allows researchers to flexibly adjust over - estimation and under - estimation biases in algorithm design, thereby improving the quality and performance of the learning strategy. ### Main Contributions 1. **Computational Cost**: A tractable form of the 2RA Q - learning algorithm is proposed, and its computational cost per iteration is comparable to that of Watkins' Q - learning. 2. **Convergence**: For any selection parameter \( N \) and regularization parameter sequence \( \{\rho_n\}_{n \in \mathbb{N}} \) (where \( \lim_{n \to \infty} \rho_n = 0 \)), it is proved that 2RA Q - learning asymptotically converges to the true Q - function. 3. **Estimation Bias Control**: It is shown how to control the estimation bias in 2RA Q - learning by selecting parameters \( \rho \) and \( N \), and when \( N \to \infty \), the proposed estimation scheme becomes unbiased. 4. **Mean - Square Error**: Under certain technical assumptions, it is proved that the asymptotic mean - square error of 2RA Q - learning is equal to that of Watkins' Q - learning, provided that the selected learning rate is \( N \) times that of Watkins' Q - learning. 5. **Numerical Experiments**: Through synthetic MDP settings and practical experiments in the OpenAI Gym suite, the theoretical properties of 2RA Q - learning are verified, and it shows good performance in practical applications, usually outperforming other Q - learning variants. ### Problem Background In reinforcement learning, Q - learning is a widely used algorithm for learning the optimal policy of Markov decision processes (MDPs). However, the standard Q - learning has the problem of over - estimation bias, which affects the quality of the learning strategy. To alleviate this problem, researchers have proposed various improvement methods, such as Double Q - learning and Maxmin Q - learning, but these methods have their own advantages and disadvantages. For example, although Double Q - learning avoids over - estimation bias, it introduces under - estimation bias. ### Core Idea of 2RA Q - learning 2RA Q - learning precisely controls the estimation bias by introducing a distributionally robust estimator. Specifically, the algorithm uses two parameters: - \( \rho>0 \): Quantifies the degree of introduced robustness/regularization. - \( N \in \mathbb{N} \): Describes the number of state - action estimates used to form the empirical average. In this way, 2RA Q - learning can effectively control the estimation bias while maintaining computational efficiency, thereby improving the performance of the learning strategy. ### Mathematical Expression The update rule of 2RA Q - learning is as follows: \[ \theta_n^{(i)}=\theta_n^{(i)}+\alpha_n \beta_n^{(i)}\left(b(X_n)-A_1(X_n) \theta_n^{(i)}+E_\rho(X_n, S_{n + 1}, \bar{\theta}_N^n)\right), \quad i = 1, \ldots, N, \] where: - \( \beta_n \) is a generalized independent and identically distributed Bernoulli random variable, taking values in \( \{1, \ldots, N\} \), and the probability of each component \( i \) is \( \frac{1}{N} \). - \( \bar{\theta}_N^n=\frac{1}{