Simple Policy Optimization

Zhengpeng Xie
2024-04-28
Abstract:PPO (Proximal Policy Optimization) algorithm has demonstrated excellent performance in many fields, and it is considered as a simple version of TRPO (Trust Region Policy Optimization) algorithm. However, the ratio clipping operation in PPO may not always effectively enforce the trust region constraints, this can be a potential factor affecting the stability of the algorithm. In this paper, we propose Simple Policy Optimization (SPO) algorithm, which introduces a novel clipping method for KL divergence between the old and current policies. Extensive experimental results in Atari 2600 environments indicate that, compared to the mainstream variants of PPO, SPO achieves better sample efficiency, extremely low KL divergence, and higher policy entropy, and is robust to the increase in network depth or complexity. More importantly, SPO maintains the simplicity of an unconstrained first-order algorithm. Our code is available at
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue that, in reinforcement learning, although the existing Proximal Policy Optimization (PPO) algorithm performs well in many domains, its ratio clipping operation may not effectively enforce the trust region constraint, thereby affecting the stability of the algorithm. To solve this problem, the authors propose a new algorithm called Simple Policy Optimization (SPO). The SPO algorithm introduces a new KL divergence clipping method to limit the difference between the new and old policies. Through this method, the SPO algorithm can achieve better sample efficiency, extremely low KL divergence, and higher policy entropy in the Atari 2600 environment, and it is robust to increases in network depth or complexity. Additionally, the SPO algorithm maintains the simplicity of an unconstrained first-order algorithm. These improvements make the SPO algorithm superior to mainstream PPO variants in multiple aspects.