Abstract:We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete and continuous state-action spaces, broadening the applicability of our approach. Empirical results validate the robustness and global convergence of DRPMD across various challenging robust MDP settings.
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to develop a policy gradient method with global optimality guarantee to solve Robust Markov Decision Processes (RMDPs). Specifically, the paper aims to overcome the limitations of existing RMDP methods in dealing with model uncertainty. In particular, when there are estimation errors in environmental parameters (such as transition probabilities), traditional methods may not be able to learn robust policies. To this end, the authors propose a new policy gradient algorithm - Double - Loop Robust Policy Mirror Descent (DRPMD), and prove its convergence and effectiveness.
### Main problems and challenges
1. **Model uncertainty**: In practical applications, such as in the financial and medical fields, model parameters are usually estimated from noisy and limited data, which may lead to estimation errors, making the policies learned based on these parameters perform poorly in practical applications.
2. **Robustness requirements**: To address the above problems, it is necessary to find policies that are robust to parameter errors, that is, policies that can perform well even in the worst - case scenarios.
3. **Computational complexity**: Traditional robust optimization methods have high computational costs when dealing with large - scale or continuous state spaces and are difficult to solve efficiently.
### Solutions
The paper proposes a new policy gradient method, DRPMD, which solves the above problems in the following ways:
- **Double - loop structure**: The outer loop updates the policy parameters, and the inner loop approximately solves the transition probabilities in the worst - case scenario. This structure can effectively handle non - convex optimization problems and ensure global convergence.
- **Adaptive tolerance**: An adaptive tolerance sequence is introduced to reduce the computational cost of the inner - loop problem while ensuring the convergence of the algorithm.
- **Applicable to multiple parameterized policies**: DRPMD can be applied to different policy parameterization forms, such as softmax parameterization, Gaussian parameterization, etc., enhancing the flexibility and applicability of the algorithm.
### Key contributions
1. **Global optimality guarantee**: DRPMD is the first policy gradient method using softmax parameterization in robust MDPs, providing a global optimality guarantee.
2. **Fast convergence rate**: For a specific type of robust MDP (such as s - rectangular robust MDP), DRPMD achieves a faster convergence speed.
3. **Innovative inner - loop solution method**: The Transition Mirror Ascent (TMA) and its stochastic variant MCTMA are proposed for efficiently solving the inner - loop maximization problem.
4. **Extended parameterization forms**: Two new transition parameterization methods are introduced, improving the scalability of the algorithm in large - scale and continuous state spaces.
In summary, by proposing the DRPMD algorithm, this paper provides an effective solution that can learn robust and globally optimal policies in the presence of model uncertainty.