Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning

Dake Zhang,Boxiang Lyu,Shuang Qiu,Mladen Kolar,Tong Zhang
2024-07-10
Abstract:We study risk-sensitive reinforcement learning (RL), a crucial field due to its ability to enhance decision-making in scenarios where it is essential to manage uncertainty and minimize potential adverse outcomes. Particularly, our work focuses on applying the entropic risk measure to RL problems. While existing literature primarily investigates the online setting, there remains a large gap in understanding how to efficiently derive a near-optimal policy based on this risk measure using only a pre-collected dataset. We center on the linear Markov Decision Process (MDP) setting, a well-regarded theoretical framework that has yet to be examined from a risk-sensitive standpoint. In response, we introduce two provably sample-efficient algorithms. We begin by presenting a risk-sensitive pessimistic value iteration algorithm, offering a tight analysis by leveraging the structure of the risk-sensitive performance measure. To further improve the obtained bounds, we propose another pessimistic algorithm that utilizes variance information and reference-advantage decomposition, effectively improving both the dependence on the space dimension $d$ and the risk-sensitivity factor. To the best of our knowledge, we obtain the first provably efficient risk-sensitive offline RL algorithms.
Machine Learning,Optimization and Control,Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to design an efficient risk - sensitive algorithm in offline reinforcement learning (Offline Reinforcement Learning, Offline RL), especially in the case of using linear function approximation. Specifically, the paper focuses on how to use pre - collected data sets to learn a near - optimal policy while being able to manage uncertainty and minimize potential adverse outcomes. This is achieved by applying the entropic risk measure to the reinforcement learning problem. The paper fills the gap between online - setting research and offline - setting research. In particular, under the Linear Markov Decision Process (Linear MDP) framework, how to effectively derive a near - optimal policy from pre - collected data sets has not been fully explored. ### Key points of the paper: 1. **Problem background**: - Reinforcement learning is becoming increasingly important in scenarios that require managing uncertainty and reducing potential adverse consequences, such as in the fields of finance, optimal control, neuroscience, and psychology. - Most of the existing work focuses on the online setting, that is, the learner can interact with the environment and explore. However, at the theoretical level, little is known about how to learn such policies with provable efficiency in the offline setting. The offline setting means that the learner has a pre - collected data set but cannot interact with the environment. 2. **Research objectives**: - Design a risk - sensitive reinforcement learning algorithm that is provably effective on offline data sets. - Pay special attention to how to optimize the entropic risk measure in the offline setting. This involves the risk measure \( V_\beta := \frac{1}{\beta} \log \left( E \left[ e^{\beta R} \right] \right) \) defined in the Markov decision process, where \(\beta\) is an adjustable parameter used to control risk sensitivity. 3. **Main contributions**: - Propose two Pessimistic Value Iteration Algorithms that are sample - efficient when dealing with linear MDPs. - The first algorithm is a Pessimistic Value Iteration Algorithm that eliminates spurious correlations by taking advantage of the structure of the entropic risk measure. - The second algorithm further utilizes variance information and reference advantage decomposition, aiming to improve the dependence on the feature space dimension \(d\) and the risk - sensitive factor, thereby providing tighter theoretical guarantees. 4. **Technical challenges**: - How to achieve pessimism in risk - sensitive offline RL, especially for a specific entropic risk measure. - How to incorporate variance estimation in the algorithm to improve the precision of theoretical guarantees. By solving the above problems, the paper provides new insights and methods for risk - sensitive offline reinforcement learning. Especially in highly risk - sensitive applications such as finance, these results have important theoretical and practical significance.