Abstract:Proximal Policy Optimization (PPO) has been broadly applied to various domains, including Large Language Model (LLM) optimization and Robotics learning, etc. However, PPO is limited by a fixed setting for the clipping bound. Specifically, there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process. Truncating the ratio of the new and old policies with a unique clipping bound ensures stable training and can achieve the best training performance. Additionally, previous research suggests that a fixed clipping bound limits the agent's exploration. Therefore, researching a dynamical clipping bound to enhance PPO's performance can be highly beneficial. Different from previous clipping approaches, we consider increasing the maximum cumulative Return in reinforcement learning (RL) tasks as the preference of the RL task, and propose a bi-level proximal policy optimization paradigm, which involves not only optimizing the policy but also dynamically adjusting the clipping bound to reflect the preference of the RL tasks to further elevate the training outcomes and stability of PPO. Based on this bi-level proximal policy optimization paradigm, we introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO). This algorithm utilizes a multi-armed bandit algorithm to reflect RL preferences (we also validate that such approach can be utilized to reflect human preference), recommending the optimal clipping bound for PPO in each epoch, thereby achieving more stable and better training outcomes.

What problem does this paper attempt to address?

The problem this paper attempts to address is the limitation of the fixed clipping bound in the Proximal Policy Optimization (PPO) algorithm. Specifically, the existing PPO algorithm uses a fixed clipping bound to limit the deviation between the new and old policies to ensure training stability. However, this fixed setting lacks theoretical basis, cannot guarantee optimality throughout the training process, and restricts the agent's exploration capability. Therefore, it is significant to study methods for dynamically adjusting the clipping bound to enhance the performance of PPO. To overcome these issues, the paper proposes a preference-based bi-level proximal policy optimization framework that not only optimizes the policy but also dynamically adjusts the clipping bound to reflect the preferences of the reinforcement learning task, thereby further improving training outcomes and stability. Based on this framework, the authors introduce a new algorithm—Preference-based Proximal Policy Optimization (Pb-PPO). Pb-PPO utilizes the multi-armed bandit algorithm to reflect task preferences (and also verifies that this method can be used to reflect human preferences), recommending the optimal clipping bound in each training round to achieve more stable and better training results. Through benchmark tests in multiple environments, such as Gym-Mujoco and PyBullet-Gym, experimental results show that Pb-PPO exhibits more stable training curves and better overall training performance compared to PPO with a fixed clipping bound and other various clipping methods, including higher sample efficiency and better training outcomes.

A dynamical clipping approach with task feedback for Proximal Policy Optimization

Behavior Proximal Policy Optimization

Beyond Reward: Offline Preference-guided Policy Optimization

Authentic Boundary Proximal Policy Optimization

Truly Proximal Policy Optimization

Proximal Policy Optimization Smoothed Algorithm

Proximal policy optimization via enhanced exploration efficiency

Beyond the Boundaries of Proximal Policy Optimization

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Proximal Policy Optimization with Mixed Distributed Training

A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric

Trust Region-Guided Proximal Policy Optimization

Fast Proximal Policy Optimization

Proximal Policy Optimization with Relative Pearson Divergence

Transductive Off-policy Proximal Policy Optimization

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

Proximal Policy Optimization Algorithms

Proximal Policy Optimization with Adaptive Exploration

PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay

PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clipping