Abstract:Robust Markov decision processes (MDPs) provide a general framework to model decision problems where the system dynamics are changing or only partially known. Efficient methods for some \texttt{sa}-rectangular robust MDPs exist, using its equivalence with reward regularized MDPs, generalizable to online settings. In comparison to \texttt{sa}-rectangular robust MDPs, \texttt{s}-rectangular robust MDPs are less restrictive but much more difficult to deal with. Interestingly, recent works have established the equivalence between \texttt{s}-rectangular robust MDPs and policy regularized MDPs. But we don't have a clear understanding to exploit this equivalence, to do policy improvement steps to get the optimal value function or policy. We don't have a clear understanding of greedy/optimal policy except it can be stochastic. There exist no methods that can naturally be generalized to model-free settings. We show a clear and explicit equivalence between \texttt{s}-rectangular $L_p$ robust MDPs and policy regularized MDPs that resemble very much policy entropy regularized MDPs widely used in practice. Further, we dig into the policy improvement step and concretely derive optimal robust Bellman operators for \texttt{s}-rectangular $L_p$ robust MDPs. We find that the greedy/optimal policies in \texttt{s}-rectangular $L_p$ robust MDPs are threshold policies that play top $k$ actions whose $Q$ value is greater than some threshold (value), proportional to the $(p-1)$th power of its advantage. In addition, we show time complexity of (\texttt{sa} and \texttt{s}-rectangular) $L_p$ robust MDPs is the same as non-robust MDPs up to some log factors. Our work greatly extends the existing understanding of \texttt{s}-rectangular robust MDPs and naturally generalizable to online settings.

Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs

Finding Near Optimal Policies via Reducive Regularization in Markov Decision Processes

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning.

Efficient Policy Iteration for Robust Markov Decision Processes via Regularization

An Aggressive Reduction on the Complexity of Optimization for Non-Strongly Convex Objectives

Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems

Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs

Stochastic Cubic-Regularized Policy Gradient Method

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

Model-free Adaptive Dynamic Programming for Optimal Control of Discrete-time Affine Nonlinear System

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

Optimizing Norm-Bounded Weighted Ambiguity Sets for Robust MDPs

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

Adaptive Lightweight Regularization Tool for Complex Analytics

Policy Optimization finds Nash Equilibrium in Regularized General-Sum LQ Games

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning