Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Zhaolin Xue,Lihua Zhang,Zhiyan Dong
DOI: https://doi.org/10.5555/3635637.3663062
2024-01-01
Abstract:It's well-known that the Q-learning algorithm suffers the overestimation owing to using the maximum state-action value as an approximation of the maximum expected state-action value. Double Q-learning and other algorithms have been proposed as efficient solutions to alleviate the overestimation. However, these proposed methods intend to utilize multiple Q-functions to reduce the overestimation and ignore the information of single Q-function. In this paper, 1) we reinterpret the update process of Q-learning, build a more precise model compatible with previous model. 2) We propose a novel and simple method to control the maximum bias by employing the information of single Q-function. 3) Our method not only balances between the overestimation and the underestimation, but also attains the minimum bias under proper hyper-parameters. 4) Moreover, it can be naturally generalized to the discrete control domain and continuous control tasks. We reveal that our algorithms outperform Double DQN and other algorithms on some representative games and some classical off-policy actor-critic algorithms can also gain benefits from our method.
What problem does this paper attempt to address?