Abstract:Proximal Policy Optimization (PPO) has been broadly applied to various domains, including Large Language Model (LLM) optimization and Robotics learning, etc. However, PPO is limited by a fixed setting for the clipping bound. Specifically, there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process. Truncating the ratio of the new and old policies with a unique clipping bound ensures stable training and can achieve the best training performance. Additionally, previous research suggests that a fixed clipping bound limits the agent's exploration. Therefore, researching a dynamical clipping bound to enhance PPO's performance can be highly beneficial. Different from previous clipping approaches, we consider increasing the maximum cumulative Return in reinforcement learning (RL) tasks as the preference of the RL task, and propose a bi-level proximal policy optimization paradigm, which involves not only optimizing the policy but also dynamically adjusting the clipping bound to reflect the preference of the RL tasks to further elevate the training outcomes and stability of PPO. Based on this bi-level proximal policy optimization paradigm, we introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO). This algorithm utilizes a multi-armed bandit algorithm to reflect RL preferences (we also validate that such approach can be utilized to reflect human preference), recommending the optimal clipping bound for PPO in each epoch, thereby achieving more stable and better training outcomes.

Transductive Off-policy Proximal Policy Optimization

Behavior Proximal Policy Optimization

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Beyond Reward: Offline Preference-guided Policy Optimization

Truly Proximal Policy Optimization

Beyond the Boundaries of Proximal Policy Optimization

Proximal Policy Optimization Algorithms

Trust Region-Guided Proximal Policy Optimization

Model-Based Reinforcement Learning via Proximal Policy Optimization

Authentic Boundary Proximal Policy Optimization

Fast Proximal Policy Optimization

Reflective Policy Optimization

Proximal policy optimization via enhanced exploration efficiency

Proximal Policy Optimization with Mixed Distributed Training

Fast-PPO: Proximal Policy Optimization with Optimal Baseline Method

Policy Optimization with Model-based Explorations

A dynamical clipping approach with task feedback for Proximal Policy Optimization

Proximal Policy Optimization Smoothed Algorithm

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

DPO Meets PPO: Reinforced Token Optimization for RLHF