Abstract:In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\widetilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is the sample complexity problem in offline distribution - robust linear Markov decision processes (Lin - RMDPs). Specifically, the researchers focus on how to learn policies that are robust to environmental changes in a sample - efficient manner in high - dimensional state - action spaces. The following is a specific description of the problems that the paper attempts to solve: 1. **Limitations of Offline Reinforcement Learning**: - In offline reinforcement learning (RL), due to the lack of active exploration, the robustness of the model becomes particularly important, especially when there are differences between the simulated environment and the actual deployment environment (sim - to - real gap), and these differences may significantly weaken the performance of the learned policies. 2. **The Need for Distribution Robustness**: - In order to enable the learned policies to maintain good performance in the face of environmental uncertainties, it is necessary to develop distribution - robust offline RL algorithms. These algorithms aim to optimize the performance in the worst - case scenario, that is, when the environment falls within a predefined uncertainty set, the policies can still work effectively. 3. **The Challenge of Sample Complexity**: - Most of the existing provable distribution - robust offline RL algorithms are only applicable to tabular settings with finite state and action spaces, and their sample complexity is linearly related to the size of the state - action space, which is unbearable in high - dimensional problems. Therefore, it is necessary to design a sample - efficient algorithm to handle the distribution - robust offline RL problem under linear representation. 4. **Main Research Question**: - The paper poses the following core question: Can a provable sample - efficient algorithm be designed for distribution - robust offline RL with linear representation? ### Overview of the Solution To answer the above questions, the paper proposes an algorithm called Distribution - Robust Pessimistic Least - Squares Value Iteration (DROP) and conducts a theoretical analysis of its performance. The specific contributions are as follows: - **Algorithm Design**: - A distribution - robust pessimistic least - squares value iteration algorithm (DROP) based on linear representation is proposed, and a data - driven penalty function is introduced to deal with the data scarcity problem in the offline setting. - **Theoretical Guarantees**: - The sub - optimality bound under the minimum offline data assumption is established, and a new concentration coefficient $ C^{\star}_{\text{rob}} $ is introduced to characterize the partial feature coverage of offline data. - **Improved Sample Complexity**: - Compared with existing methods, DROP improves the sample complexity by at least $ \tilde{O}(d) $ in the case of partial feature coverage, where $ d $ is the feature dimension. - **Variance - Weighted Variant**: - A variance - weighted DROP variant (DROP - V) is further developed, which further improves the sub - optimality gap under the full - feature - coverage assumption by more closely controlling the variance. In summary, through proposing DROP and its variants, this paper solves the sample complexity problem of distribution - robust offline RL in high - dimensional states and provides theoretical performance guarantees.

Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes

Settling the Sample Complexity of Model-Based Offline Reinforcement Learning

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Sample Complexity of Robust Reinforcement Learning with a Generative Model

Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage

Improved Sample Complexity Bounds for Distributionally Robust Reinforcement Learning

Achieving the Asymptotically Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Model-Free Robust Reinforcement Learning with Sample Complexity Analysis

Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning

Towards Minimax Optimality of Model-based Robust Reinforcement Learning

Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning

On the Sample Complexity of Vanilla Model-Based Offline Reinforcement Learning with Dependent Samples

Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

A Finite Sample Complexity Bound for Distributionally Robust Q-learning

On Gap-dependent Bounds for Offline Reinforcement Learning