Abstract:In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step, and focuses on tightly controlling the risk of getting into catastrophic situations at each stage. This formulation is applicable to real-world tasks that demand strong risk avoidance throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate two performance metrics under Iterated CVaR RL, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide nearly matching upper and lower bounds with respect to the number of episodes $K$. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward. For Worst Path RL, we propose an efficient algorithm with constant upper and lower bounds. Finally, our techniques for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel, and can find applications in other risk-sensitive RL problems.

What problem does this paper attempt to address?

This paper aims to solve the problem of risk - sensitive decision - making in Reinforcement Learning (RL), especially in scenarios where risk needs to be strictly controlled. Specifically, the paper proposes a new Iterated Conditional Value - at - Risk (Iterated CVaR) reinforcement learning model to maximize the tail rewards at each step and focus on avoiding entering catastrophic states. ### Main Problem Description 1. **Maximizing Tail Rewards**: At each decision - making stage, maximize the α - quantile (i.e., CVaR) rewards in the worst - case scenario. 2. **Avoiding Catastrophic States**: Ensure strict risk control throughout the decision - making process, especially in application scenarios such as autonomous driving, clinical treatment planning, and robotics, which have extremely high requirements for low risk. ### Model Features - **Dynamic Multi - stage Risk Measurement**: Different from the traditional single - stage CVaR, the Iterated CVaR is defined by backward iteration and focuses on the worst - part rewards at each step. - **Wide Application**: Applicable to tasks that require strong risk aversion, such as autonomous driving, clinical treatment planning, and robot control. ### Research Content 1. **Performance Indicators**: - **Regret Minimization (RM)**: The goal is to minimize the cumulative regret in all episodes. - **Best Policy Identification (BPI)**: It measures the number of episodes required to identify the optimal policy. 2. **Algorithm Design**: - Two efficient algorithms, ICVaR - RM and ICVaR - BPI, are proposed for the two performance indicators RM and BPI respectively. - Novel techniques are designed to limit the CVaR changes caused by the value function changes and decompose the regret through the distorted visitation distribution. 3. **Limit - case Study**: - When the risk level α approaches 0, the Worst Path RL is studied, whose goal is to maximize the minimum possible cumulative reward. - A simple and efficient algorithm MaxWP is developed for Worst Path RL, and constant - level regret upper and lower bounds independent of K are provided. ### Main Contributions 1. **New Model Proposal**: A new Iterated CVaR reinforcement learning model is proposed, which can control the risk in the entire decision - making process more delicately. 2. **Performance Analysis**: Effective algorithms are proposed for the two performance indicators RM and BPI, and almost - matching regret / sample complexity upper and lower bounds are established. 3. **Limit - case Handling**: The limit - case Worst Path RL when α approaches 0 is studied, and the corresponding algorithms and theoretical results are provided. ### Formula Examples - **CVaR Definition**: \[ \text{CVaR}_\alpha(X) = \sup_{x \in \mathbb{R}} \left\{ x - \frac{1}{\alpha} \mathbb{E}\left[(x - X)^+\right] \right\} \] where $(x - X)^+ = \max(x - X, 0)$. - **Bellman Equation**: \[ \begin{cases} Q^\pi_h(s, a) = r(s, a) + \text{CVaR}_\alpha[V^\pi_{h + 1}(s')] \\ V^\pi_h(s) = Q^\pi_h(s, \pi_h(s)) \\ V^\pi_{H + 1}(s) = 0, \quad \forall s \in S \end{cases} \] - **Worst Path RL Bellman Equation**: \[ \begin{cases} Q^\pi_h(s, a) = r(s, a) + \min_{s' \sim p(\cdot|s,a)} V

Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path

Risk-Sensitive Reinforcement Learning: Iterated CVaR and the Worst Path.

Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback

Provably Efficient CVaR RL in Low-rank MDPs

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

Towards Safe Reinforcement Learning Via Constraining Conditional Value-at-Risk

Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures

Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk

CVaR-Constrained Policy Optimization for Safe Reinforcement Learning

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Extreme Risk Mitigation in Reinforcement Learning using Extreme Value Theory

Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value at Risk

Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning

Efficient Risk-Averse Reinforcement Learning

A Convex Programming Approach to Data-Driven Risk-Averse Reinforcement Learning

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

Uncertainty-Aware Reinforcement Learning for Portfolio Optimization

Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach

Risk-Sensitive Reinforcement Learning with Exponential Criteria