Abstract:The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair, fusing the demonstrator's state distribution with reward information into the guidance reward. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.

Decentralized Distributed PPO

Decentralized Policy Optimization

Proximal Policy Optimization Algorithms

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

Proximal Policy Optimization with Mixed Distributed Training

Beyond the Boundaries of Proximal Policy Optimization

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

JointPPO: Diving Deeper into the Effectiveness of PPO in Multi-Agent Reinforcement Learning

Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework

A Portable Accelerator of Proximal Policy Optimization for Robots

A Deep Reinforcement Learning Approach to Efficient Distributed Optimization

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

Guided Exploration with Proximal Policy Optimization using a Single Demonstration

An Improved PPO for Multiple Unmanned Aerial Vehicles

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Truly Proximal Policy Optimization

Coordinated Proximal Policy Optimization

Decentralized Multi-Agent Reinforcement Learning with Global State Prediction