Markov decision processes with observation costs: framework and computation with a penalty scheme

Christoph Reisinger,Jonathan Tam
2023-12-06
Abstract:We consider Markov decision processes where the state of the chain is only given at chosen observation times and of a cost. Optimal strategies involve the optimisation of observation times as well as the subsequent action values. We consider the finite horizon and discounted infinite horizon problems, as well as an extension with parameter uncertainty. By including the time elapsed from observations as part of the augmented Markov system, the value function satisfies a system of quasi-variational inequalities (QVIs). Such a class of QVIs can be seen as an extension to the interconnected obstacle problem. We prove a comparison principle for this class of QVIs, which implies uniqueness of solutions to our proposed problem. Penalty methods are then utilised to obtain arbitrarily accurate solutions. Finally, we perform numerical experiments on three applications which illustrate our framework.
Optimization and Control,Numerical Analysis
What problem does this paper attempt to address?
The paper primarily focuses on solving optimization problems in Markov Decision Processes (MDPs) with observation costs. Specifically, it considers a scenario where the state of the MDP can only be obtained at selected observation times, and each observation incurs a certain cost. Therefore, the optimization strategy needs to consider not only the selection of observation times but also the subsequent action values. The key contributions of the paper can be summarized as follows: 1. **Construction of the Observation Cost Model (OCM)**: - Introduced the concept of observation cost on the basis of standard MDPs, meaning that observing the system state requires a certain cost. - OCM assumes that actions remain unchanged between two observations. - The paper formulates OCM as a Partially Observable Markov Decision Process (POMDP), where the passage of time is considered as part of the extended Markov system. 2. **Mathematical Modeling of the Optimization Problem**: - Defined finite-horizon and infinite-horizon discounted problems. - For each problem, derived optimality equations through dynamic programming, which are expressed in the form of Quasi-Variational Inequalities (QVIs). - Established a comparison principle, proving the existence and uniqueness of the solutions to the proposed QVIs. 3. **Numerical Methods and Experimental Validation**: - Proposed a penalty scheme to efficiently solve the aforementioned QVIs. - Validated the effectiveness of the proposed framework through numerical experiments, particularly demonstrating its application potential in three case studies. In summary, the main objective of this paper is to provide a theoretical framework for MDPs with observation costs and to propose effective numerical methods to solve such problems. This has broad application prospects in fields such as maintenance, portfolio optimization, sensor detection, and reinforcement learning.