Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes

He Wang,Laixi Shi,Yuejie Chi
2024-06-27
Abstract:In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\widetilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.
Machine Learning,Statistics Theory
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is the sample complexity problem in offline distribution - robust linear Markov decision processes (Lin - RMDPs). Specifically, the researchers focus on how to learn policies that are robust to environmental changes in a sample - efficient manner in high - dimensional state - action spaces. The following is a specific description of the problems that the paper attempts to solve: 1. **Limitations of Offline Reinforcement Learning**: - In offline reinforcement learning (RL), due to the lack of active exploration, the robustness of the model becomes particularly important, especially when there are differences between the simulated environment and the actual deployment environment (sim - to - real gap), and these differences may significantly weaken the performance of the learned policies. 2. **The Need for Distribution Robustness**: - In order to enable the learned policies to maintain good performance in the face of environmental uncertainties, it is necessary to develop distribution - robust offline RL algorithms. These algorithms aim to optimize the performance in the worst - case scenario, that is, when the environment falls within a predefined uncertainty set, the policies can still work effectively. 3. **The Challenge of Sample Complexity**: - Most of the existing provable distribution - robust offline RL algorithms are only applicable to tabular settings with finite state and action spaces, and their sample complexity is linearly related to the size of the state - action space, which is unbearable in high - dimensional problems. Therefore, it is necessary to design a sample - efficient algorithm to handle the distribution - robust offline RL problem under linear representation. 4. **Main Research Question**: - The paper poses the following core question: Can a provable sample - efficient algorithm be designed for distribution - robust offline RL with linear representation? ### Overview of the Solution To answer the above questions, the paper proposes an algorithm called Distribution - Robust Pessimistic Least - Squares Value Iteration (DROP) and conducts a theoretical analysis of its performance. The specific contributions are as follows: - **Algorithm Design**: - A distribution - robust pessimistic least - squares value iteration algorithm (DROP) based on linear representation is proposed, and a data - driven penalty function is introduced to deal with the data scarcity problem in the offline setting. - **Theoretical Guarantees**: - The sub - optimality bound under the minimum offline data assumption is established, and a new concentration coefficient \( C^{\star}_{\text{rob}} \) is introduced to characterize the partial feature coverage of offline data. - **Improved Sample Complexity**: - Compared with existing methods, DROP improves the sample complexity by at least \( \tilde{O}(d) \) in the case of partial feature coverage, where \( d \) is the feature dimension. - **Variance - Weighted Variant**: - A variance - weighted DROP variant (DROP - V) is further developed, which further improves the sub - optimality gap under the full - feature - coverage assumption by more closely controlling the variance. In summary, through proposing DROP and its variants, this paper solves the sample complexity problem of distribution - robust offline RL in high - dimensional states and provides theoretical performance guarantees.