Abstract:Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two different Q functions. For practicality, this theoretical basis is then extended to allow us to combine action exploration with the Q value update, except for the premise that we have a surrogate policy that behaves like this exploration policy. In practice, we construct such an exploration policy with a few sampled actions, and to meet the premise, we learn such a surrogate policy by minimizing the KL divergence between the target policy and the exploration policy constructed by the conservative Q. We evaluate our method on the Mujoco benchmark and demonstrate superior performance compared to previous state-of-the-art methods across various environments, particularly in the most complex Humanoid environment.

A Study of Count-Based Exploration and Bonus for Reinforcement Learning

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Exploration in Feature Space for Reinforcement Learning

Unifying Count-Based Exploration and Intrinsic Motivation

CMBE: Curiosity-driven Model-Based Exploration for Multi-Agent Reinforcement Learning in Sparse Reward Settings

BeBold: Exploration Beyond the Boundary of Explored Regions

Careful at Estimation and Bold at Exploration

Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning

Dynamic Subgoal-based Exploration via Bayesian Optimization

Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks.

MADE: Exploration via Maximizing Deviation from Explored Regions

VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts

Backtracking Exploration for Reinforcement Learning

Two Heads Are Better Than One: A Simple Exploration Framework for Efficient Multi-Agent Reinforcement Learning.

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

Approximate Exploration through State Abstraction

Deterministic Exploration via Stationary Bellman Error Maximization

Success Probability of Exploration: a Concrete Analysis of Learning Efficiency

Bounded Exploration with World Model Uncertainty in Soft Actor-Critic Reinforcement Learning Algorithm

Discovering and Exploiting Sparse Rewards in a Learned Behavior Space

Learning to explore by reinforcement over high-level options