Abstract:In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to optimize quantile - based risk measures (such as VaR, Value - at - Risk) in Markov Decision Processes (MDPs). Specifically, the author proposes a new Q - learning algorithm, aiming to optimize quantile MDPs with strong convergence and performance guarantees. Compared with previous methods, this algorithm does not require known transition probabilities or solving complex saddle - point equations, and is applicable to other model - free Reinforcement Learning (RL) algorithms. ### Main contributions of the paper: 1. **New dynamic programming decomposition**: A new, simple dynamic programming (DP) decomposition method for quantile MDPs is proposed, which does not require known transition probabilities or solving complex saddle - point equations. 2. **VaR - Q - learning algorithm**: VaR - Q - learning is introduced, which is the first model - free method that can be proven to optimize the VaR of the return distribution, and its convergence is established. 3. **Theoretical analysis**: A strict convergence proof is provided to ensure the effectiveness of the algorithm. 4. **Numerical experiments**: The effectiveness of the algorithm is verified through experiments in multiple tabular domains, showing its robustness in different environments and risk levels. ### Specific methods for solving problems: - **Dynamic programming (DP) decomposition**: By introducing the risk level as part of the state space and performing Bellman recursion, the optimal policy is found. - **Q - learning algorithm**: The traditional quantization loss function is replaced with a soft - quantile loss function to ensure that the Q - learning algorithm converges to a unique solution. - **Discretization scheme**: In order to handle the continuous risk level α, a uniform discretization scheme is adopted to discretize the risk level into a finite number of values, making the calculation feasible. ### Formula summary: - Quantile definition: \[ q^-(\alpha(\tilde{x})) := \min_{\tau \in \bar{R}} \{ P[\tilde{x} \leq \tau] \geq \alpha \} \] \[ q^+(\alpha(\tilde{x})) := \max_{\tau \in \bar{R}} \{ P[\tilde{x} < \tau] \leq \alpha \} \] - VaR definition: \[ \text{VaR}_{\alpha}[\tilde{x}] := q^+(\alpha(\tilde{x})) \] - Bellman operator: \[ (B_{\text{max}}q)(s, \alpha, a) := r(s, a) + \gamma \cdot \max_{o \in O_{sa}(\alpha)} \min_{s' \in S} \max_{a' \in A} q(s', o_{s'}, a') \] - Soft - quantile loss function: \[ \ell^{\kappa}_{\alpha}(\delta) = \begin{cases} (1-\alpha) \frac{(1 - \delta/\kappa)^2}{2} & \text{if } \delta < -\kappa \\ (1-\alpha) \frac{\delta^2}{2\kappa} & \text{if } \delta \in [-\kappa, 0) \\ \alpha \frac{\delta^2}{2\kappa} & \text{if } \delta \in [0, \kappa) \\ \alpha \frac{(1 + \delta/\kappa)^2}{2} & \text{if } \delta \geq \kappa \end{cases} \]

Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

An Analysis of Quantile Temporal-Difference Learning

QUANTILE-BASED POLICY OPTIMIZATION FOR REINFORCEMENT LEARNING

Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity

Distributional Reinforcement Learning With Quantile Regression

Relative Q-Learning for Average-Reward Markov Decision Processes with Continuous States

Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Minimax Optimal Q Learning with Nearest Neighbors

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

Utilizing Maximum Mean Discrepancy Barycenter for Propagating the Uncertainty of Value Functions in Reinforcement Learning

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Safe Wasserstein Constrained Deep Q-Learning

Implicit Quantile Networks for Distributional Reinforcement Learning

Fully Parameterized Quantile Function for Distributional Reinforcement Learning.

Risk-sensitive Markov Decision Process and Learning under General Utility Functions

Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition

Quantile Reinforcement Learning