Abstract:We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete and continuous state-action spaces, broadening the applicability of our approach. Empirical results validate the robustness and global convergence of DRPMD across various challenging robust MDP settings.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to develop a policy gradient method with global optimality guarantee to solve Robust Markov Decision Processes (RMDPs). Specifically, the paper aims to overcome the limitations of existing RMDP methods in dealing with model uncertainty. In particular, when there are estimation errors in environmental parameters (such as transition probabilities), traditional methods may not be able to learn robust policies. To this end, the authors propose a new policy gradient algorithm - Double - Loop Robust Policy Mirror Descent (DRPMD), and prove its convergence and effectiveness. ### Main problems and challenges 1. **Model uncertainty**: In practical applications, such as in the financial and medical fields, model parameters are usually estimated from noisy and limited data, which may lead to estimation errors, making the policies learned based on these parameters perform poorly in practical applications. 2. **Robustness requirements**: To address the above problems, it is necessary to find policies that are robust to parameter errors, that is, policies that can perform well even in the worst - case scenarios. 3. **Computational complexity**: Traditional robust optimization methods have high computational costs when dealing with large - scale or continuous state spaces and are difficult to solve efficiently. ### Solutions The paper proposes a new policy gradient method, DRPMD, which solves the above problems in the following ways: - **Double - loop structure**: The outer loop updates the policy parameters, and the inner loop approximately solves the transition probabilities in the worst - case scenario. This structure can effectively handle non - convex optimization problems and ensure global convergence. - **Adaptive tolerance**: An adaptive tolerance sequence is introduced to reduce the computational cost of the inner - loop problem while ensuring the convergence of the algorithm. - **Applicable to multiple parameterized policies**: DRPMD can be applied to different policy parameterization forms, such as softmax parameterization, Gaussian parameterization, etc., enhancing the flexibility and applicability of the algorithm. ### Key contributions 1. **Global optimality guarantee**: DRPMD is the first policy gradient method using softmax parameterization in robust MDPs, providing a global optimality guarantee. 2. **Fast convergence rate**: For a specific type of robust MDP (such as s - rectangular robust MDP), DRPMD achieves a faster convergence speed. 3. **Innovative inner - loop solution method**: The Transition Mirror Ascent (TMA) and its stochastic variant MCTMA are proposed for efficiently solving the inner - loop maximization problem. 4. **Extended parameterization forms**: Two new transition parameterization methods are introduced, improving the scalability of the algorithm in large - scale and continuous state spaces. In summary, by proposing the DRPMD algorithm, this paper provides an effective solution that can learn robust and globally optimal policies in the presence of model uncertainty.

Policy Gradient for Robust Markov Decision Processes

Policy Gradient in Robust MDPs with Global Convergence Guarantee

A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes

Policy Optimization with Stochastic Mirror Descent.

Policy Gradient for Rectangular Robust Markov Decision Processes

Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Policy Gradient Method For Robust Reinforcement Learning

First-order Policy Optimization for Robust Markov Decision Process

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

Block Policy Mirror Descent

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

Soft Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity

Policy Mirror Descent with Lookahead

Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Structure Matters: Dynamic Policy Gradient

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Stochastic Cubic-Regularized Policy Gradient Method

Robust Average-Reward Markov Decision Processes

A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost MDP