Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

Jia Lin Hau,Erick Delage,Esther Derman,Mohammad Ghavamzadeh,Marek Petrik
2024-11-01
Abstract:In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to optimize quantile - based risk measures (such as VaR, Value - at - Risk) in Markov Decision Processes (MDPs). Specifically, the author proposes a new Q - learning algorithm, aiming to optimize quantile MDPs with strong convergence and performance guarantees. Compared with previous methods, this algorithm does not require known transition probabilities or solving complex saddle - point equations, and is applicable to other model - free Reinforcement Learning (RL) algorithms. ### Main contributions of the paper: 1. **New dynamic programming decomposition**: A new, simple dynamic programming (DP) decomposition method for quantile MDPs is proposed, which does not require known transition probabilities or solving complex saddle - point equations. 2. **VaR - Q - learning algorithm**: VaR - Q - learning is introduced, which is the first model - free method that can be proven to optimize the VaR of the return distribution, and its convergence is established. 3. **Theoretical analysis**: A strict convergence proof is provided to ensure the effectiveness of the algorithm. 4. **Numerical experiments**: The effectiveness of the algorithm is verified through experiments in multiple tabular domains, showing its robustness in different environments and risk levels. ### Specific methods for solving problems: - **Dynamic programming (DP) decomposition**: By introducing the risk level as part of the state space and performing Bellman recursion, the optimal policy is found. - **Q - learning algorithm**: The traditional quantization loss function is replaced with a soft - quantile loss function to ensure that the Q - learning algorithm converges to a unique solution. - **Discretization scheme**: In order to handle the continuous risk level α, a uniform discretization scheme is adopted to discretize the risk level into a finite number of values, making the calculation feasible. ### Formula summary: - Quantile definition: \[ q^-(\alpha(\tilde{x})) := \min_{\tau \in \bar{R}} \{ P[\tilde{x} \leq \tau] \geq \alpha \} \] \[ q^+(\alpha(\tilde{x})) := \max_{\tau \in \bar{R}} \{ P[\tilde{x} < \tau] \leq \alpha \} \] - VaR definition: \[ \text{VaR}_{\alpha}[\tilde{x}] := q^+(\alpha(\tilde{x})) \] - Bellman operator: \[ (B_{\text{max}}q)(s, \alpha, a) := r(s, a) + \gamma \cdot \max_{o \in O_{sa}(\alpha)} \min_{s' \in S} \max_{a' \in A} q(s', o_{s'}, a') \] - Soft - quantile loss function: \[ \ell^{\kappa}_{\alpha}(\delta) = \begin{cases} (1-\alpha) \frac{(1 - \delta/\kappa)^2}{2} & \text{if } \delta < -\kappa \\ (1-\alpha) \frac{\delta^2}{2\kappa} & \text{if } \delta \in [-\kappa, 0) \\ \alpha \frac{\delta^2}{2\kappa} & \text{if } \delta \in [0, \kappa) \\ \alpha \frac{(1 + \delta/\kappa)^2}{2} & \text{if } \delta \geq \kappa \end{cases} \]