Abstract:Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. Existing literature on RL theory largely focuses on risk-neutral settings where the decision-maker learns to maximize the expected cumulative reward. However, in practical scenarios such as portfolio management and e-commerce recommendations, decision-makers often persist in heterogeneous risk preferences subject to outcome uncertainties, which can not be well-captured by the risk-neural framework. Incorporating these preferences can be approached through utility theory, yet the development of risk-sensitive RL under general utility functions remains an open question for theoretical exploration. In this paper, we consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a Markov decision process (MDP). To facilitate the Dynamic Programming Principle and Bellman equation, we enlarge the state space with an additional dimension that accounts for the cumulative reward. We propose a discretized approximation scheme to the MDP under enlarged state space, which is tractable and key for algorithmic design. We then propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward. When a simulator is accessible, our algorithm efficiently learns a near-optimal policy with guaranteed sample complexity. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy while ensuring a guaranteed regret bound. For both algorithms, we match the theoretical lower bounds for the risk-neutral setting.

Regret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures

Regret Bounds for Markov Decision Processes with Recursive Optimized Certainty Equivalents

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning

Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning

Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator

Settling Constant Regrets in Linear Markov Decision Processes

Square-root regret bounds for continuous-time episodic Markov decision processes

Risk-Sensitivity Vanishing Limit for Controlled Markov Processes

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Taming Equilibrium Bias in Risk-Sensitive Multi-Agent Reinforcement Learning

Risk-sensitive Markov Decision Process and Learning under General Utility Functions

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Dynamic Regret of Online Markov Decision Processes

Risk-Averse Markov Decision Processes through a Distributional Lens

Duality in Regret Measures and Risk Measures

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path