Abstract:In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization ($P^2MPO$), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that $P^2MPO$ is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of the distributions induced by the optimal robust policy and the perturbed models around the nominal model. By tailoring specific model estimation subroutines for concrete examples of RMDPs, including tabular RMDPs, factored RMDPs, kernel and neural RMDPs, we prove that $P^2MPO$ enjoys a $\tilde{\mathcal{O}}(n^{-1/2})$ convergence rate, where $n$ is the dataset size. We highlight that all these examples, except tabular RMDPs, are first identified and proven tractable by this work. Furthermore, we continue our study of robust offline RL in the robust Markov games (RMGs). By extending the double pessimism principle identified for single-agent RMDPs, we propose another algorithm framework that can efficiently find the robust Nash equilibria among players using only robust unilateral (partial) coverage data. To our best knowledge, this work proposes the first general learning principle -- double pessimism -- for robust offline RL and shows that it is provably efficient with general function approximation.

Efficient Duple Perturbation Robustness in Low-rank MDPs

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Time-Constrained Robust MDPs

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Distributionally robust optimization for sequential decision-making

Online Policy Optimization for Robust MDP

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Robust Deep Reinforcement Learning with Adaptive Adversarial Perturbations in Action Space

Optimizing Norm-Bounded Weighted Ambiguity Sets for Robust MDPs

Rectangularity and duality of distributionally robust Markov Decision Processes

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes

Sequential Decision-Making under Uncertainty: A Robust MDPs review

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets

Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs

Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes

Robust Anytime Learning of Markov Decision Processes

Solving Robust MDPs through No-Regret Dynamics

Robust Reinforcement Learning for Continuous Control with Model Misspecification