Online Resource Allocation in Episodic Markov Decision Processes

Duksang Lee,William Overman,Dabeen Lee
2023-10-19
Abstract:This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ where $\rho\in(0,1)$ is the budget parameter, $H$ is the length of the horizon, $S$ and $A$ are the numbers of states and actions, and $T$ is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem.
Data Structures and Algorithms,Machine Learning,Optimization and Control
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily investigates the long-term resource allocation problem across multiple periods in multi-stage decision processes. Specifically: 1. **Problem Background**: - The paper considers an online resource allocation problem where multi-stage decisions need to be made in each period. - Each period has an unknown non-stationary transition function, a stochastic non-stationary reward function, and a resource consumption function. 2. **Problem Modeling**: - The problem is modeled as an online allocation problem in a Constrained Markov Decision Process (CMDP) with finite time periods. - Two different settings are proposed: observe-then-decide and decide-then-observe. 3. **Algorithms and Results**: - An Online Dual Mirror Descent Algorithm is developed, which achieves near-optimal regret bounds in both settings. - For the observe-then-decide setting, it is proven that the expected regret under the dynamic omniscient optimal policy is \(\tilde{O}(\rho^{-1}H^{3/2}\sqrt{SAT})\). - For the decide-then-observe setting, it is proven that the regret under the static optimal policy is \(\tilde{O}(\rho^{-1}H^{3/2}\sqrt{SAT})\) with high probability guarantees. 4. **Numerical Experiments**: - The numerical performance of the method is tested on a variant of the resource-constrained inventory management problem. In summary, the paper aims to address the long-term resource allocation problem, particularly in multi-stage decision processes, by proposing a new online resource allocation framework and designing corresponding algorithms.