Online Resource Allocation in Episodic Markov Decision Processes

Duksang Lee,William Overman,Dabeen Lee

2023-10-19

Abstract:This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ where $\rho\in(0,1)$ is the budget parameter, $H$ is the length of the horizon, $S$ and $A$ are the numbers of states and actions, and $T$ is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem.

Data Structures and Algorithms,Machine Learning,Optimization and Control

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily investigates the long-term resource allocation problem across multiple periods in multi-stage decision processes. Specifically: 1. **Problem Background**: - The paper considers an online resource allocation problem where multi-stage decisions need to be made in each period. - Each period has an unknown non-stationary transition function, a stochastic non-stationary reward function, and a resource consumption function. 2. **Problem Modeling**: - The problem is modeled as an online allocation problem in a Constrained Markov Decision Process (CMDP) with finite time periods. - Two different settings are proposed: observe-then-decide and decide-then-observe. 3. **Algorithms and Results**: - An Online Dual Mirror Descent Algorithm is developed, which achieves near-optimal regret bounds in both settings. - For the observe-then-decide setting, it is proven that the expected regret under the dynamic omniscient optimal policy is $\tilde{O}(\rho^{-1}H^{3/2}\sqrt{SAT})$. - For the decide-then-observe setting, it is proven that the regret under the static optimal policy is $\tilde{O}(\rho^{-1}H^{3/2}\sqrt{SAT})$ with high probability guarantees. 4. **Numerical Experiments**: - The numerical performance of the method is tested on a variant of the resource-constrained inventory management problem. In summary, the paper aims to address the long-term resource allocation problem, particularly in multi-stage decision processes, by proposing a new online resource allocation framework and designing corresponding algorithms.

Online Resource Allocation in Episodic Markov Decision Processes

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Dynamic Regret of Online Markov Decision Processes

Contextual Decision-Making with Knapsacks Beyond the Worst Case

Online Stochastic Allocation of Reusable Resources

Online Contextual Decision-Making with a Smart Predict-then-Optimize Method

Sequential Fair Resource Allocation under a Markov Decision Process Framework

Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

Two-stage Online Reusable Resource Allocation: Reservation, Overbooking and Confirmation Call

Dynamic Resource Allocation: The Geometry and Robustness of Constant Regret

Online Resource Allocation in Markov Chains.

Dynamic Resource Allocation: Algorithmic Design Principles and Spectrum of Achievable Performances

Optimal Regularized Online Allocation by Adaptive Re-Solving

The Best of Many Worlds: Dual Mirror Descent for Online Allocation Problems

Adaptive resource allocation for media services based on semi-Markov decision process

Modeling and Optimizing Resource Allocation Decisions through Multi-model Markov Decision Processes with Capacity Constraints

Decoupling Learning and Decision-Making: Breaking the $\mathcal{O}(\sqrt{T})$ Barrier in Online Resource Allocation with First-Order Methods

Target-Following Online Resource Allocation Using Proxy Assignments

Online Resource Allocation with Non-Stationary Customers

Long-Term Resource Allocation Fairness in Average Markov Decision Process (AMDP) Environment

Online Resource Allocation: Bandits feedback and Advice on Time-varying Demands