Abstract:This paper studies the minimizing risk problems in Markov decision processes with countable state space and reward set. The objective is to find a policy which minimizes the probability (risk) that the total discounted rewards do not exceed a specified value (target). In this sort of model, the decision made by the decision maker depends not only on system's states, but also on his target values. By introducing the decision-maker's state, we formulate a framework for minimizing risk models. The policies discussed depend on target values and the rewards may be arbitrary real numbers. For the finite horizon model, the main results obtained are: (i) The optimal value functions are distribution functions of the target, (ii) there exists an optimal deterministic Markov policy, and (iii) a policy is optimal if and only if at each realizable state it always takes optimal action. In addition, we obtain a sufficient condition and a necessary condition for the existence of finite horizon optimal policy independent of targets and we give an algorithm computing finite horizon optimal policies and optimal value functions. For an infinite horizon model, we establish the optimality equation and we obtain the structure property of optimal policy. We prove that the optimal value function is a distribution function of target and we present a new approximation formula which is the generalization of the nonnegative rewards cases. An example which illustrates the mistakes of previous literature shows that the existence of optimal policy has not been proved really. In this paper, we give an existence condition, which is a sufficient and necessary condition for the existence of an infinite horizon optimal policy independent of targets, and we point out that whether there exists an optimal policy remains an open problem in the general case.

Risk and optimal policies in bandit experiments

Approximate optimality and the risk/reward tradeoff in a class of bandit problems

A Survey of Risk-Aware Multi-Armed Bandits

Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits

Weak Signal Asymptotics for Sequentially Randomized Experiments

Minimax Off-Policy Evaluation for Multi-Armed Bandits

Zero-Inflated Bandits

Achieving Exponential Asymptotic Optimality in Average-Reward Restless Bandits without Global Attractor Assumption

Diminishing Exploration: A Minimalist Approach to Piecewise Stationary Multi-Armed Bandits

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Minimizing Risk Models in Markov Decision Processes with Policies Depending on Target Values

Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards

Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations

Risk-Averse Stochastic Convex Bandit

LP-based policies for restless bandits: necessary and sufficient conditions for (exponentially fast) asymptotic optimality

Off-Policy Risk Assessment in Contextual Bandits

A central limit theorem, loss aversion and multi-armed bandits

Understanding the stochastic dynamics of sequential decision-making processes: A path-integral analysis of multi-armed bandits

Overcoming Free-Riding in Bandit Games

Bayesian Incentive-Compatible Bandit Exploration

An Optimal-Control Approach to Infinite-Horizon Restless Bandits: Achieving Asymptotic Optimality with Minimal Assumptions