Abstract:We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2 d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in low-rank MDPs.

What problem does this paper attempt to address?

This paper discusses how to use conditional value-at-risk (CVaR) as a risk-sensitive reinforcement learning (RL) objective in low-rank Markov decision processes (MDPs). Traditional RL typically focuses on maximizing the expected cumulative reward, but in high-risk scenarios such as autonomous driving, finance, and healthcare applications, this approach may overlook the risk of rare but catastrophic events. CVaR, as a risk management tool, quantifies the expected return in the worst-case scenario and is therefore introduced into the RL system to balance risk. The paper proposes a new algorithm called ELA (Representation Learning for CVaR) that optimizes the CVaR criterion in low-rank MDPs while using maximum likelihood estimation (MLE) to learn the dynamics of the model and construct an upper confidence bound (UCB) reward to balance exploration and exploitation as well as representation learning. The algorithm analysis shows that ELA can provide an ε-optimal CVaR with a sample complexity of approximately 1/ε^2, making it the first CVaR RL algorithm proven to be effective and sample-efficient in low-rank MDPs. In addition, to improve computational efficiency, the paper also designs a planning oracle based on least-squares value iteration (LSVI), called ELLA (Representation Learning with LSVI for CVaR). It uses a discretized reward function to find an approximately optimal policy in the learned model and its computational cost only depends on the dimension of the representation, not the size of the state space. The main contributions of the paper include: 1. The design of the ELA algorithm, which achieves sample efficiency in CVaR RL in low-rank MDPs for the first time. 2. The proposal of the ELLA algorithm, which achieves efficient planning through LSVI with polynomial time complexity. The related work section of the paper mentions previous research in the fields of low-rank MDPs and CVaR RL, pointing out that existing methods are either limited to tabular MDPs or inefficient in large-scale state spaces. The new algorithms address these limitations and enable risk avoidance, exploration, and representation learning in handling unknown environments.

Provably Efficient CVaR RL in Low-rank MDPs

Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path

Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

Risk-Sensitive Reinforcement Learning: Iterated CVaR and the Worst Path.

Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk

CVaR-Constrained Policy Optimization for Safe Reinforcement Learning

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures

Towards Safe Reinforcement Learning Via Constraining Conditional Value-at-Risk

Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach

Risk-Averse Bayes-Adaptive Reinforcement Learning

Risk‐sensitive markov decision processes with long‐run CVaR criterion

On the Maximization of Long-Run Reward CVaR for Markov Decision Processes

Risk-sensitive Markov Decision Process and Learning under General Utility Functions

Extreme Risk Mitigation in Reinforcement Learning using Extreme Value Theory

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Provable Risk-Sensitive Distributional Reinforcement Learning with General Function Approximation