Adapting Double Q-Learning for Continuous Reinforcement Learning

Arsenii Kuznetsov
2023-09-26
Abstract:Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main objective of this paper is to address the overestimation bias problem in off-policy algorithms for continuous reinforcement learning. Specifically, the paper proposes a novel method to correct this bias, inspired by Double Q-Learning. Traditional off-policy reinforcement learning algorithms tend to produce overestimation bias in the estimation of the state-action value function (Q-function), which can lead to suboptimal actions being assigned higher value estimates, thereby affecting training efficiency and final performance. The new method proposed in the paper eliminates the causes of overestimation bias by representing the policy as a mixture of two components, each evaluated and optimized by different networks. This approach is similar to Double Deep Q-Learning (DDQN) in discrete environments but is applicable to continuous action spaces. Experimental results show that in the MuJoCo environment, applying this algorithm reduces overestimation bias and improves performance compared to scenarios where no bias elimination techniques are used. Moreover, a significant advantage of this method is that it does not require grid search for hyperparameters specific to the environment, thus improving the sample efficiency of the algorithm. However, preliminary experimental results indicate that while the new method shows potential in reducing overestimation bias, its performance in some cases is still inferior to existing methods that have been fine-tuned. Therefore, future research directions may need to further refine the algorithm design to achieve the goal of completely eliminating overestimation bias.