Logarithmic Regret Bounds for Continuous-Time Average-Reward Markov Decision Processes

Xuefeng Gao,Xun Yu Zhou
DOI: https://doi.org/10.1137/23m1584101
IF: 2.2
2024-09-11
SIAM Journal on Control and Optimization
Abstract:SIAM Journal on Control and Optimization, Volume 62, Issue 5, Page 2529-2556, October 2024. We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
mathematics, applied,automation & control systems
What problem does this paper attempt to address?