Abstract:We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we develop a policy-based first-order method, namely the robust policy mirror descent (RPMD), and establish an $\mathcal{O}(\log(1/\epsilon))$ and $\mathcal{O}(1/\epsilon)$ iteration complexity for finding an $\epsilon$-optimal policy, with two increasing-stepsize schemes. The prior convergence of RPMD is applicable to any Bregman divergence, provided the policy space has bounded radius measured by the divergence when centering at the initial policy. Moreover, when the Bregman divergence corresponds to the squared euclidean distance, we establish an $\mathcal{O}(\max \{1/\epsilon, 1/(\eta \epsilon^2)\})$ complexity of RPMD with any constant stepsize $\eta$. For a general class of Bregman divergences, a similar complexity is also established for RPMD with constant stepsizes, provided the uncertainty set satisfies the relative strong convexity. We further develop a stochastic variant, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. For general Bregman divergences, we establish an $\mathcal{O}(1/\epsilon^2)$ and $\mathcal{O}(1/\epsilon^3)$ sample complexity with two increasing-stepsize schemes. For the euclidean Bregman divergence, we establish an $\mathcal{O}(1/\epsilon^3)$ sample complexity with constant stepsizes. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Policy Gradient for Robust Markov Decision Processes

A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes

Policy Gradient for Rectangular Robust Markov Decision Processes

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets

Policy Gradient Method For Robust Reinforcement Learning

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

Deterministic Policy Gradients with General State Transitions

First-order Policy Optimization for Robust Markov Decision Process

Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Stochastic Cubic-Regularized Policy Gradient Method

Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

Soft Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model