Abstract:This paper studies the computation of robust deterministic policies for Markov Decision Processes (MDPs) in the Lightning Does Not Strike Twice (LDST) model of Mannor, Mebel and Xu (ICML '12). In this model, designed to provide robustness in the face of uncertain input data while not being overly conservative, transition probabilities and rewards are uncertain and the uncertainty set is constrained by a budget that limits the number of states whose parameters can deviate from their nominal values. Mannor et al. (ICML '12) showed that optimal randomized policies for MDPs in the LDST regime can be efficiently computed when only the rewards are affected by uncertainty. In contrast to these findings, we observe that the computation of optimal deterministic policies is $N\!P$-hard even when only a single terminal reward may deviate from its nominal value and the MDP consists of $2$ time periods. For this hard special case, we then derive a constant-factor approximation algorithm by combining two relaxations based on the Knapsack Cover and Generalized Assignment problem, respectively. For the general problem with possibly a large number of deviations and a longer time horizon, we derive strong inapproximability results for computing robust deterministic policies as well as $\Sigma_2^p$-hardness, indicating that the general problem does not even admit a compact mixed integer programming formulation.

What problem does this paper attempt to address?

This paper attempts to solve the problem of computing robust deterministic policies in Markov Decision Processes (MDPs) under the Budgeted Uncertainty framework. Specifically, the research focuses on how to deal with the uncertainty of parameters (such as transition probabilities and rewards) under the Lightning Does Not Strike Twice (LDST) model and ensure that the resulting policies can perform well in the worst - case scenario. ### Main Problem Description 1. **Background and Motivation**: - In many practical applications, important parameters in MDPs (such as rewards or transition probabilities) are difficult to estimate accurately because the data is limited and may be noisy. - If this uncertainty is not considered when optimizing policies, the quality of the solution may be significantly reduced. - To deal with this uncertainty, researchers have proposed robust MDP models, aiming to optimize policies that can handle the worst - case scenario. 2. **Limitations of Existing Models**: - Although the Rectangular Uncertainty Set assumption makes it computationally feasible to find the optimal robust policy, it may lead to overly conservative solutions because it allows the worst - case scenario to occur simultaneously in each state. - Mannor, Mebel and Xu introduced the LDST model, which reduces conservatism by introducing a budget to limit the number of states that can deviate from their nominal values. 3. **Core Problems of the Research**: - The paper mainly explores the complexity and feasibility of computing the optimal deterministic policy under the LDST model. - It is found that even in a simple two - stage MDP, when only one terminal reward may deviate from its nominal value, computing the optimal deterministic policy is NP - hard. - For more general cases, the paper proves that computing robust deterministic policies is Σp^2 - hard, which means that no compact mixed - integer programming formula can be found under standard complexity assumptions. 4. **Contributions and Results**: - The paper provides a constant - factor approximation algorithm for two - stage MDPs by combining two relaxation methods of the Knapsack Cover and Generalized Assignment problems. - For the general problem, the paper shows strong inapproximability results, indicating that no bounded approximation guarantee can be obtained in polynomial time. ### Formula Representation - **Worst - Case Reward**: \[ \hat{R}(\pi) := \min_{\hat{p} \in \hat{P}} E[r(s^\pi_T)] = \min_{\hat{p} \in \hat{P}} \sum_{s \in S_T} Pr[s^\pi_T = s] r(s) \] where $ s^\pi_0, \ldots, s^\pi_T $ are the random state sequences induced by policy $\pi$ under the alternative transition kernel $\hat{p}$. - **Loss Definition**: \[ L(\pi) := R(\pi) - \hat{R}(\pi) \] which represents the decrease of policy $\pi$ in the worst - case scenario relative to its nominal reward. ### Conclusion Through theoretical analysis and algorithm design, the paper reveals the complexity of computing robust deterministic policies under the LDST model and provides effective approximation algorithms for specific cases. This provides an important theoretical basis and practical tool for dealing with uncertainty problems in the real world.

Robust Deterministic Policies for Markov Decision Processes under Budgeted Uncertainty

Improving Robust Decisions with Data

Multistage Robust Mixed-Integer Optimization under Endogenous Uncertainty

Robust Anytime Learning of Markov Decision Processes

Robust Average-Reward Markov Decision Processes

Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets

Certifiably Robust Policies for Uncertain Parametric Environments

Learning Robust Policies for Uncertain Parametric Markov Decision Processes

Robust Batch Policy Learning in Markov Decision Processes

Efficient Policy Iteration for Robust Markov Decision Processes via Regularization

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics

Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes

First-order Policy Optimization for Robust Markov Decision Process

Distributionally robust optimization for sequential decision-making

Robust Markov Decision Processes without Model Estimation

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Sequential Decision-Making under Uncertainty: A Robust MDPs review

Robustness to Modeling Errors in Risk-Sensitive Markov Decision Problems with Markov Risk Measures

Online Policy Optimization for Robust MDP