Adapting Double Q-Learning for Continuous Reinforcement Learning

Arsenii Kuznetsov

2023-09-26

Abstract:Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

The main objective of this paper is to address the overestimation bias problem in off-policy algorithms for continuous reinforcement learning. Specifically, the paper proposes a novel method to correct this bias, inspired by Double Q-Learning. Traditional off-policy reinforcement learning algorithms tend to produce overestimation bias in the estimation of the state-action value function (Q-function), which can lead to suboptimal actions being assigned higher value estimates, thereby affecting training efficiency and final performance. The new method proposed in the paper eliminates the causes of overestimation bias by representing the policy as a mixture of two components, each evaluated and optimized by different networks. This approach is similar to Double Deep Q-Learning (DDQN) in discrete environments but is applicable to continuous action spaces. Experimental results show that in the MuJoCo environment, applying this algorithm reduces overestimation bias and improves performance compared to scenarios where no bias elimination techniques are used. Moreover, a significant advantage of this method is that it does not require grid search for hyperparameters specific to the environment, thus improving the sample efficiency of the algorithm. However, preliminary experimental results indicate that while the new method shows potential in reducing overestimation bias, its performance in some cases is still inferior to existing methods that have been fine-tuned. Therefore, future research directions may need to further refine the algorithm design to achieve the goal of completely eliminating overestimation bias.

Adapting Double Q-Learning for Continuous Reinforcement Learning

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

Deep Reinforcement Learning with Double Q-Learning

Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods

On the Estimation Bias in Double Q-Learning

Self-correcting Q-learning.

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Efficient Continuous Control with Double Actors and Regularized Critics

Decorrelated Double Q-learning

Careful at Estimation and Bold at Exploration

Adaptive Order Q-learning

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Unified Policy Optimization for Continuous-action Reinforcement Learning in Non-stationary Tasks and Games

Softmax Deep Double Deterministic Policy Gradients

Swap Softmax Twin Delayed Deep Deterministic Policy Gradient

Better Value Estimation in Q-learning-based Multi-Agent Reinforcement Learning

Q-learning with biased policy rules

Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Multiagent Soft Q-Learning

Accurate Q-Learning.