Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback

Yu Chen,Yihan Du,Pihe Hu,Siwei Wang,Desheng Wu,Longbo Huang
2023-12-04
Abstract:Risk-sensitive reinforcement learning (RL) aims to optimize policies that balance the expected reward and risk. In this paper, we present a novel risk-sensitive RL framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations, enriched by human feedback. These new formulations provide a principled way to guarantee safety in each decision making step throughout the control process. Moreover, integrating human feedback into risk-sensitive RL framework bridges the gap between algorithmic decision-making and human participation, allowing us to also guarantee safety for human-in-the-loop systems. We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis. Furthermore, we establish a matching lower bound to corroborate the optimality of our algorithms in a linear context.
Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of how to effectively handle risk - sensitive tasks in Reinforcement Learning (RL), especially in the case of function approximation and human feedback. Specifically, the paper proposes a novel risk - sensitive RL framework that uses the Iterated Conditional Value - at - Risk (ICVaR) objective and combines linear or general function approximation with human feedback. These new methods provide a principled way to ensure safety at each decision - making step, and by integrating human feedback, bridge the gap between algorithmic decision - making and human participation, thus ensuring the safety of human - machine interaction systems. ### Main contributions 1. **ICVaR - RL with Function Approximation**: - Proposes two new sample - efficient algorithms (ICVaR - L and ICVaR - G), which are suitable for linear and general function approximation respectively. - Proves that the regret upper bound of the ICVaR - L algorithm under linear function approximation is \( \mathcal{O}\left(\alpha^{-(H + 1)} (d^2H^4 + dH^6)K\right) \), where \( \alpha \) is the risk level, \( d \) is the dimension of state - action features, \( H \) is the length of each segment, and \( K \) is the number of segments. - Establishes a matching lower bound \( \Omega\left(\sqrt{\alpha^{-(H - 1)}d^2K}\right) \) to verify the approximate optimality of the algorithm. 2. **ICVaR - RL with Human Feedback**: - Proposes the first risk - sensitive RLHF algorithm (ICVaR - HF) that deals with infinite sets of transition and reward functions and comparison - based human feedback under general function approximation. - Proves that the regret upper bound of the ICVaR - HF algorithm is \( \mathcal{O}\left(\sqrt{KH^3\alpha^{-(H + 1)}(\sqrt{HD_P}+\sqrt{m^{-1}D_R})}\right) \), where \( D_P \) and \( D_R \) are the dimension parameters of the set of transition probabilities and the set of reward functions respectively, and \( m \) is the positive lower bound of the linking function gradient. ### Technical challenges 1. **Non - linear problems**: - The ICVaR measure is the quantile expectation of a distorted distribution, which breaks the linearity of the risk - neutral Bellman equation and makes it difficult to estimate the true value function. - To this end, the paper develops new CVaR approximation and parameter estimation methods. 2. **Technical challenges in the function approximation setting**: - Since the state space can be very large or even infinite, the traditional sample mean technique cannot effectively calculate the CVaR operator and estimate the transition probability. - The paper proposes a new elliptic potential lemma for a more fine - grained analysis of regret accumulation. 3. **Risk - sensitive setting based on human feedback**: - The regret analysis of standard human - feedback - based risk - neutral RL is not applicable to the risk - sensitive setting. - The paper develops new discretization methods and regret decomposition techniques to handle infinite sets of reward functions and comparison - based human feedback. ### Conclusion By proposing new algorithms and technical tools, this paper successfully solves several key problems in risk - sensitive RL with function approximation and human feedback, providing a theoretical basis and practical guidance for further research in this field.