Abstract:Proximal policy optimization (PPO) has yielded state-of-the-art results in policy search, a subfield of reinforcement learning, with one of its key points being the use of a surrogate objective function to restrict the step size at each policy update. Although such restriction is helpful, the algorithm still suffers from performance instability and optimization inefficiency from the sudden flattening of the curve. To address this issue we present a PPO variant, named Proximal Policy Optimization Smooth Algorithm (PPOS), and its critical improvement is the use of a functional clipping method instead of a flat clipping method. We compare our method with PPO and PPORB, which adopts a rollback clipping method, and prove that our method can conduct more accurate updates at each time step than other PPO methods. Moreover, we show that it outperforms the latest PPO variants on both performance and stability in challenging continuous control tasks.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the performance instability and low optimization efficiency in the proximal policy optimization (PPO) of existing reinforcement learning algorithms when updating policies. Specifically, PPO improves performance by limiting the step size of each policy update, but this limiting method sometimes causes the performance curve to suddenly flatten, affecting the stability and efficiency of the algorithm. To solve this problem, the paper proposes a variant of PPO - the Proximal Policy Optimization Smooth Algorithm (PPOS). The key improvement of PPOS lies in using the functional clipping method instead of the traditional flat clipping method, which enables the algorithm to make more precise adjustments during each update, thus showing higher performance and stability in continuous control tasks.
### Background and Problems of the Paper
- **Reinforcement Learning**: In particular, model - free reinforcement learning based on deep models has made significant progress in recent years, and its application range extends from video games, board games to robotics and complex control tasks.
- **Policy Gradient Methods**: Widely used in model - free policy search algorithms, they gradually update policies by estimating the gradient of the expected return and finally converge to the optimal policy.
- **PPO**: Adopts a probability ratio clipping mechanism to limit the step size of policy updates. Although it simplifies the optimization process, in some cases it cannot truly limit the probability ratio within the clipping range, resulting in performance instability.
- **PPORB**: Proposes a roll - back operation, aiming to prevent the policy from being over - pushed during the training process. However, this method may cause the policy search to oscillate near the optimal policy in high - dimensional tasks and introduces an additional hyper - parameter that requires empirical tuning.
### Core Contributions of PPOS
- **Functional Clipping Method**: PPOS introduces a new clipping function \(F_{\text{PPOS}}(r_{s,a}(\pi), \epsilon, \alpha)\), which uses the hyperbolic tangent function (tanh) to smoothly limit policy updates and avoid the abrupt changes brought by the traditional flat clipping method.
- **Performance and Stability**: Experimental results show that PPOS not only improves the learning speed and final reward in multiple continuous control tasks but also maintains higher stability.
- **Hyper - parameter Selection**: The paper provides an exponential regression function based on the observation dimension to guide the selection of the hyper - parameter \(\alpha\), making it easier for users to adjust parameters in different tasks.
### Experimental Results
- **High - dimensional Tasks**: In high - dimensional tasks (such as Humanoid - v2 and Ant - v2), PPOS significantly outperforms PPO and PPORB in terms of learning speed and final reward.
- **Medium - and Low - dimensional Tasks**: In medium - and low - dimensional tasks (such as HalfCheetah - v2, Swimmer - v2 and Reacher - v2), PPOS also shows better performance and stability.
- **Hyper - parameter Sensitivity**: For tasks of different dimensions, the selection of the hyper - parameter \(\alpha\) of PPOS has a significant impact on performance, and the paper provides specific suggestions and experimental data support.
### Conclusion
PPOS significantly improves the policy update limiting ability by introducing the functional clipping method while maintaining the stability and efficiency of the algorithm. The paper also provides a practical hyper - parameter selection guide, which is helpful for the efficient deployment of the algorithm in new environments in the future. Future research directions include exploring different clipping mechanisms, studying the relationship between the clipping range and the clipping function, and further optimizing the trust - region method.