Abstract:Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two different Q functions. For practicality, this theoretical basis is then extended to allow us to combine action exploration with the Q value update, except for the premise that we have a surrogate policy that behaves like this exploration policy. In practice, we construct such an exploration policy with a few sampled actions, and to meet the premise, we learn such a surrogate policy by minimizing the KL divergence between the target policy and the exploration policy constructed by the conservative Q. We evaluate our method on the Mujoco benchmark and demonstrate superior performance compared to previous state-of-the-art methods across various environments, particularly in the most complex Humanoid environment.

A Memory-Greedy Policy with Guaranteed Convergence for Accelerating Reinforcement Learning

Greedy exploration policy of Q-learning based on state balance

Leveraging Efficiency Through Hybrid Prioritized Experience Replay in Door Environment.

Multi-Agent Deep Deterministic Policy Gradient Algorithm Based on Classification Experience Replay

A Dynamic Adjusting Reward Function Method for Deep Reinforcement Learning with Adjustable Parameters

Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

Efficient use of heuristics for accelerating XCS-based policy learning in Markov games

A Novel Policy Based on Action Confidence Limit to Improve Exploration Efficiency in Reinforcement Learning

Multiple Suboptimal Policies Integrated Reinforcement Learning Algorithm for Path Planning

Deterministic policy optimization with clipped value expansion and long-horizon planning

Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Z-Score Experience Replay in Off-Policy Deep Reinforcement Learning

Replay Memory as An Empirical MDP: Combining Conservative Estimation with Experience Replay

Careful at Estimation and Bold at Exploration

Approximate Policy-Based Accelerated Deep Reinforcement Learning.

Policy Augmentation: An Exploration Strategy for Faster Convergence of Deep Reinforcement Learning Algorithms

Trajectory-Oriented Policy Optimization with Sparse Rewards

Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints

Improving Policy Generalization for Teacher-Student Reinforcement Learning.