Provably Efficient CVaR RL in Low-rank MDPs

Yulai Zhao,Wenhao Zhan,Xiaoyan Hu,Ho-fung Leung,Farzan Farnia,Wen Sun,Jason D. Lee
2023-11-21
Abstract:We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2 d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in low-rank MDPs.
Machine Learning
What problem does this paper attempt to address?
This paper discusses how to use conditional value-at-risk (CVaR) as a risk-sensitive reinforcement learning (RL) objective in low-rank Markov decision processes (MDPs). Traditional RL typically focuses on maximizing the expected cumulative reward, but in high-risk scenarios such as autonomous driving, finance, and healthcare applications, this approach may overlook the risk of rare but catastrophic events. CVaR, as a risk management tool, quantifies the expected return in the worst-case scenario and is therefore introduced into the RL system to balance risk. The paper proposes a new algorithm called ELA (Representation Learning for CVaR) that optimizes the CVaR criterion in low-rank MDPs while using maximum likelihood estimation (MLE) to learn the dynamics of the model and construct an upper confidence bound (UCB) reward to balance exploration and exploitation as well as representation learning. The algorithm analysis shows that ELA can provide an ε-optimal CVaR with a sample complexity of approximately 1/ε^2, making it the first CVaR RL algorithm proven to be effective and sample-efficient in low-rank MDPs. In addition, to improve computational efficiency, the paper also designs a planning oracle based on least-squares value iteration (LSVI), called ELLA (Representation Learning with LSVI for CVaR). It uses a discretized reward function to find an approximately optimal policy in the learned model and its computational cost only depends on the dimension of the representation, not the size of the state space. The main contributions of the paper include: 1. The design of the ELA algorithm, which achieves sample efficiency in CVaR RL in low-rank MDPs for the first time. 2. The proposal of the ELLA algorithm, which achieves efficient planning through LSVI with polynomial time complexity. The related work section of the paper mentions previous research in the fields of low-rank MDPs and CVaR RL, pointing out that existing methods are either limited to tabular MDPs or inefficient in large-scale state spaces. The new algorithms address these limitations and enable risk avoidance, exploration, and representation learning in handling unknown environments.