Abstract:Deep reinforcement learning (DRL) faces significant challenges in addressing hard-exploration tasks with sparse or deceptive rewards and large state spaces. These challenges severely limit the practical application of DRL. Most previous exploration methods relied on complex architectures to estimate state novelty or introduced sensitive hyperparameters, resulting in instability. To mitigate these issues, we propose an efficient adaptive trajectory-constrained exploration strategy for DRL. The proposed method guides the agent's policy away from suboptimal solutions by regarding previous offline demonstrations as references. Specifically, this approach gradually expands the exploration scope of the agent and strives for optimality in a constrained optimization manner. Additionally, we introduce a novel policy-gradient-based optimization algorithm that utilizes adaptive clipped trajectory-distance rewards for both single- and multi-agent reinforcement learning. We provide a theoretical analysis of our method, including a deduction of the worst-case approximation error bounds, highlighting the validity of our approach for enhancing exploration. To evaluate the effectiveness of the proposed method, we conducted experiments on two large 2D grid world mazes and several MuJoCo tasks. The extensive experimental results demonstrated the significant advantages of our method in achieving temporally extended exploration and avoiding myopic and suboptimal behaviors in both single- and multi-agent settings. Notably, the specific metrics and quantifiable results further support these findings. The code used in the study is available at https://github.com/buaawgj/TACE .

Adaptive Cooperative Exploration for Reinforcement Learning from Imperfect Demonstrations

Expert demonstrations guide reward decomposition for multi-agent cooperation

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Cooperative Multi-Agent Policy Gradients with Sub-optimal Demonstration

Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

Overcoming Exploration in Reinforcement Learning with Demonstrations

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

Co-Imitation Learning without Expert Demonstration

Multi-Agent Exploration Via Self-Learning and Social Learning

Shaping Rewards for Reinforcement Learning with Imperfect Demonstrations using Generative Models

Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward

Reinforcement Learning with Supervision from Noisy Demonstrations

Two Heads Are Better Than One: A Simple Exploration Framework for Efficient Multi-Agent Reinforcement Learning.

Demonstration actor critic

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

Policy Gradient from Demonstration and Curiosity

Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble

AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent