Abstract:We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with sharp convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.

Policy Optimization with Stochastic Mirror Descent.

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

Block Policy Mirror Descent

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

Stochastic Variance-Reduced Policy Gradient

Policy Mirror Descent Inherently Explores Action Space

Reflective Policy Optimization

Policy Gradient for Robust Markov Decision Processes

Stochastic Cubic-Regularized Policy Gradient Method

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Stochastic first-order methods for average-reward Markov decision processes

Proximal Policy Optimization Algorithms

Stochastic Recursive Momentum for Policy Gradient Methods

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Mirror descent method for stochastic multi-objective optimization

Policy Mirror Descent with Lookahead

Variational Delayed Policy Optimization

Trajectory-Oriented Policy Optimization with Sparse Rewards

Sparse Q-learning with Mirror Descent

Discovered Policy Optimisation