Fast Proximal Policy Optimization

Weiqi Zhao,Haobo Jiang,Jin Xie
DOI: https://doi.org/10.1007/978-3-031-02444-3_6
2022-01-01
Abstract:Proximal policy optimization (PPO) is one of the most promising deep reinforcement learning methods and has achieved remarkable success in a variety of challenging control tasks. However, its overall updating gradient of a batch of samples may mislead the optimization of some sub-samples. It potentially reduces the sample efficiency and degrades the final decision performance. Although the minimum operation of PPO can relieve it, its slow escape speed makes it difficult to escape the wrong optimization range within the limited epochs of the minibatch update. In this paper, we propose a novel fast version of PPO named fast-PPO that replaces the original minimum operation with two accelerating operations called linear-pulling and quadratic-pulling, respectively. Both of them can increase the updating weight of the gradient for the misled samples so that the gradient of the overall object follows their expected optimization direction. Extensive experiments on classic discrete control tasks and MuJoCo based continuous control tasks verify the effectiveness of our proposed fast PPO.
What problem does this paper attempt to address?