Abstract:In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization ($P^2MPO$), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that $P^2MPO$ is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of the distributions induced by the optimal robust policy and the perturbed models around the nominal model. By tailoring specific model estimation subroutines for concrete examples of RMDPs, including tabular RMDPs, factored RMDPs, kernel and neural RMDPs, we prove that $P^2MPO$ enjoys a $\tilde{\mathcal{O}}(n^{-1/2})$ convergence rate, where $n$ is the dataset size. We highlight that all these examples, except tabular RMDPs, are first identified and proven tractable by this work. Furthermore, we continue our study of robust offline RL in the robust Markov games (RMGs). By extending the double pessimism principle identified for single-agent RMDPs, we propose another algorithm framework that can efficiently find the robust Nash equilibria among players using only robust unilateral (partial) coverage data. To our best knowledge, this work proposes the first general learning principle -- double pessimism -- for robust offline RL and shows that it is provably efficient with general function approximation.

Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets

Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

Is Pessimism Provably Efficient for Offline Reinforcement Learning?

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Upper Counterfactual Confidence Bounds: a New Optimism Principle for Contextual Bandits

Optimal Baseline Corrections for Off-Policy Contextual Bandits

Bayesian Regret Minimization in Offline Bandits

Pessimistic Off-Policy Optimization for Learning to Rank

LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits

Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling

On the Optimal Regret of Locally Private Linear Contextual Bandit

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits.

Contextual Continuum Bandits: Static Versus Dynamic Regret

Optimistic Information Directed Sampling

Stochastic Conservative Contextual Linear Bandits

Robust Contextual Linear Bandits.