Abstract:Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two different Q functions. For practicality, this theoretical basis is then extended to allow us to combine action exploration with the Q value update, except for the premise that we have a surrogate policy that behaves like this exploration policy. In practice, we construct such an exploration policy with a few sampled actions, and to meet the premise, we learn such a surrogate policy by minimizing the KL divergence between the target policy and the exploration policy constructed by the conservative Q. We evaluate our method on the Mujoco benchmark and demonstrate superior performance compared to previous state-of-the-art methods across various environments, particularly in the most complex Humanoid environment.

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Hierarchical Diffusion Policy: manipulation trajectory generation via contact guidance

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Careful at Estimation and Bold at Exploration

Sampling-based Exploration for Reinforcement Learning of Dexterous Manipulation

Dexterous Functional Pre-Grasp Manipulation with Diffusion Policy

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

Object Manipulation with an Anthropomorphic Robotic Hand via Deep Reinforcement Learning with a Synergy Space of Natural Hand Poses

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Nonprehensile Planar Manipulation through Reinforcement Learning with Multimodal Categorical Exploration

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation

Diff-DAgger: Uncertainty Estimation with Diffusion Policy for Robotic Manipulation

Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation