Abstract:Proximal policy optimization (PPO) is a deep reinforcement learning algorithm based on the actor–critic (AC) architecture. In the classic AC architecture, the Critic (value) network is used to estimate the value function while the Actor (policy) network optimizes the policy according to the estimated value function. The efficiency of the classic AC architecture is limited due that the policy does not directly participate in the value function update. The classic AC architecture will make the value function estimation inaccurate, which will affect the performance of the PPO algorithm. For improvement, we designed a novel AC architecture with policy feedback (AC-PF) by introducing the policy into the update process of the value function and further proposed the PPO with policy feedback (PPO-PF). For the AC-PF architecture, the policy-based expected (PBE) value function and discount reward formulas are designed by drawing inspiration from expected Sarsa. In order to enhance the sensitivity of the value function to the change of policy and to improve the accuracy of PBE value estimation at the early learning stage, we proposed a policy update method based on the clipped discount factor. Moreover, we specifically defined the loss functions of the policy network and value network to ensure that the policy update of PPO-PF satisfies the unbiased estimation of the trust region. Experiments on Atari games and control tasks show that compared to PPO, PPO-PF has faster convergence speed, higher reward, and smaller variance of reward.

Proximal Policy Optimization Based on Self-directed Action Selection

Behavior Proximal Policy Optimization

Proximal Policy Optimization Algorithms

Fast Proximal Policy Optimization

Proximal Policy Optimization with Policy Feedback

Proximal policy optimization via enhanced exploration efficiency

An Improved Proximal Policy Optimization Algorithm for Autonomous Driving Decision-Making

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Demonstration-Based Proximal Policy Optimization with Action Guidance

Proximal Policy Optimization with Future Rewards

Truly Proximal Policy Optimization

Proximal Policy Optimization Smoothed Algorithm

Fast-PPO: Proximal Policy Optimization with Optimal Baseline Method

AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control

Policy Optimization with Model-based Explorations

Decentralized Policy Optimization

Anti-Martingale Proximal Policy Optimization

Model-Based Reinforcement Learning via Proximal Policy Optimization

Proximal policy optimization with model-based methods

Proximal Policy Optimization with Mixed Distributed Training

Reflective Policy Optimization