A controlling estimation bias method: Max_Mix_Min estimator for Q-learning
Patigül Abliz
DOI: https://doi.org/10.1007/s11227-024-06181-y
IF: 3.3
2024-05-28
The Journal of Supercomputing
Abstract:Although Q-learning (QL) is widely used in reinforcement learning, it suffers from overestimation bias, which can lead to poor performance in stochastic environments due to its susceptibility to maximization bias. To address this problem, various bias correction mechanisms have been proposed. However, while these mechanisms may reduce overestimation bias, some of them introduce underestimation bias, which is undesirable in some environments. To leverage both overestimation and underestimation biases, we introduce an underestimation mechanism called the min estimator, followed by our proposed Max_Mix_Min Q-learning (M3QL) method, which incorporates a balance parameter . Our method also considers the number of N Q-functions. Initially, we theoretically analyze why our method benefits from both overestimation and underestimation bias under the assumption of different bias distributions and how sample size N affects the performance of our method. Additionally, we visualize the theoretic analysis results on Meta-chain MDP example. Theoretical analysis demonstrates that M3QL achieves bias reduction compared to QL and the underestimation mechanism. Furthermore, we theoretically prove that M3QL is unbiased on certain values of . In experimental comparisons with states of arts on Atari benchmark problems, our method consistently outperforms them. We also compare M3QL with the underestimation mechanism and Deep Q-learning (DQN) on these benchmark problems, revealing that M3QL improves the performance of underestimation mechanism and DQN on most of the benchmark problems.
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture