A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

Han Zhong,Tong Zhang
2023-06-08
Abstract:The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a $\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$ regret for it. Here $d$ is the ambient dimension of linear MDPs, $H$ is the length of each episode, and $K$ is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information. Additionally, our algorithm design features a novel multi-batched updating mechanism and the theoretical analysis utilizes a new covering number argument of value and policy classes, which might be of independent interest.
Machine Learning,Artificial Intelligence,Optimization and Control
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address an important theoretical issue in Reinforcement Learning (RL), specifically the **effectiveness of the Proximal Policy Optimization (PPO) algorithm and its optimistic variants in handling Linear Markov Decision Processes (MDPs)**. Specifically, although the PPO algorithm has achieved significant success in practical applications, its theoretical understanding remains insufficient. In particular, it is currently unclear whether PPO or its optimistic variants can effectively solve linear MDPs, which are among the simplest models with function approximation. To fill this theoretical gap, the authors propose an optimistic variant of PPO for adversarial linear MDPs with full information feedback and establish its regret bound. The regret bound is \( \tilde{O}(d^{3/4} H^2 K^{3/4}) \), where \( d \) is the dimension of the linear MDPs, \( H \) is the length of each episode, and \( K \) is the number of episodes. Compared to existing policy-based algorithms, this algorithm achieves state-of-the-art regret bounds in both stochastic linear MDPs and adversarial linear MDPs with full information feedback. Additionally, the algorithm design introduces a multi-batch update mechanism, and the theoretical analysis utilizes new value and policy class covering number arguments, which may be of independent interest.