Abstract:We study risk-sensitive reinforcement learning (RL), a crucial field due to its ability to enhance decision-making in scenarios where it is essential to manage uncertainty and minimize potential adverse outcomes. Particularly, our work focuses on applying the entropic risk measure to RL problems. While existing literature primarily investigates the online setting, there remains a large gap in understanding how to efficiently derive a near-optimal policy based on this risk measure using only a pre-collected dataset. We center on the linear Markov Decision Process (MDP) setting, a well-regarded theoretical framework that has yet to be examined from a risk-sensitive standpoint. In response, we introduce two provably sample-efficient algorithms. We begin by presenting a risk-sensitive pessimistic value iteration algorithm, offering a tight analysis by leveraging the structure of the risk-sensitive performance measure. To further improve the obtained bounds, we propose another pessimistic algorithm that utilizes variance information and reference-advantage decomposition, effectively improving both the dependence on the space dimension $d$ and the risk-sensitivity factor. To the best of our knowledge, we obtain the first provably efficient risk-sensitive offline RL algorithms.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to design an efficient risk - sensitive algorithm in offline reinforcement learning (Offline Reinforcement Learning, Offline RL), especially in the case of using linear function approximation. Specifically, the paper focuses on how to use pre - collected data sets to learn a near - optimal policy while being able to manage uncertainty and minimize potential adverse outcomes. This is achieved by applying the entropic risk measure to the reinforcement learning problem. The paper fills the gap between online - setting research and offline - setting research. In particular, under the Linear Markov Decision Process (Linear MDP) framework, how to effectively derive a near - optimal policy from pre - collected data sets has not been fully explored. ### Key points of the paper: 1. **Problem background**: - Reinforcement learning is becoming increasingly important in scenarios that require managing uncertainty and reducing potential adverse consequences, such as in the fields of finance, optimal control, neuroscience, and psychology. - Most of the existing work focuses on the online setting, that is, the learner can interact with the environment and explore. However, at the theoretical level, little is known about how to learn such policies with provable efficiency in the offline setting. The offline setting means that the learner has a pre - collected data set but cannot interact with the environment. 2. **Research objectives**: - Design a risk - sensitive reinforcement learning algorithm that is provably effective on offline data sets. - Pay special attention to how to optimize the entropic risk measure in the offline setting. This involves the risk measure $ V_\beta := \frac{1}{\beta} \log \left( E \left[ e^{\beta R} \right] \right) $ defined in the Markov decision process, where $\beta$ is an adjustable parameter used to control risk sensitivity. 3. **Main contributions**: - Propose two Pessimistic Value Iteration Algorithms that are sample - efficient when dealing with linear MDPs. - The first algorithm is a Pessimistic Value Iteration Algorithm that eliminates spurious correlations by taking advantage of the structure of the entropic risk measure. - The second algorithm further utilizes variance information and reference advantage decomposition, aiming to improve the dependence on the feature space dimension $d$ and the risk - sensitive factor, thereby providing tighter theoretical guarantees. 4. **Technical challenges**: - How to achieve pessimism in risk - sensitive offline RL, especially for a specific entropic risk measure. - How to incorporate variance estimation in the algorithm to improve the precision of theoretical guarantees. By solving the above problems, the paper provides new insights and methods for risk - sensitive offline reinforcement learning. Especially in highly risk - sensitive applications such as finance, these results have important theoretical and practical significance.

Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning

Is Pessimism Provably Efficient for Offline Reinforcement Learning?

One Risk to Rule Them All: A Risk-Sensitive Perspective on Model-Based Offline Reinforcement Learning

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

State-Aware Proximal Pessimistic Algorithms for Offline Reinforcement Learning

Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

Efficient Risk-Averse Reinforcement Learning

Neural Network Approximation for Pessimistic Offline Reinforcement Learning

Survival Instinct in Offline Reinforcement Learning

Uncertainty-aware Distributional Offline Reinforcement Learning

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

Sparsity-based Safety Conservatism for Constrained Offline Reinforcement Learning

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Risk-sensitive Markov Decision Process and Learning under General Utility Functions

De-Pessimism Offline Reinforcement Learning via Value Compensation

Online Policy Optimization for Robust MDP