Abstract:We study a general problem of allocating limited resources to heterogeneous customers over time under model uncertainty. Each type of customer can be serviced using different actions, each of which stochastically consumes some combination of resources and returns different rewards for the resources consumed. We consider a general model in which the resource consumption distribution associated with each customer type–action combination is not known but is consistent and can be learned over time. In addition, the sequence of customer types to arrive over time is arbitrary and completely unknown. We overcome both the challenges of model uncertainty and customer heterogeneity by judiciously synthesizing two algorithmic frameworks from the literature: inventory balancing, which “reserves” a portion of each resource for high-reward customer types that could later arrive based on competitive ratio analysis, and online learning, which “explores” the resource consumption distributions for each customer type under different actions based on regret analysis. We define an auxiliary problem, which allows for existing competitive ratio and regret bounds to be seamlessly integrated. Furthermore, we propose a new variant of upper confidence bound (UCB), dubbed lazyUCB, which conducts less exploration in a bid to focus on “exploitation” in view of the resource scarcity. Finally, we construct an information-theoretic family of counterexamples to show that our integrated framework achieves the best possible performance guarantee. We demonstrate the efficacy of our algorithms on both synthetic instances generated for the online matching with stochastic rewards problem under unknown probabilities and a publicly available hotel data set. Our framework is highly practical in that it requires no historical data (no fitted customer choice models or forecasting of customer arrival patterns) and can be used to initialize allocation strategies in fast-changing environments. This paper was accepted by J. George Shanthikumar, Management Science Special Section on Data-Driven Prescriptive Analytics.

Handling Varied Objectives by Online Decision Making

An adaptive variance vector-based evolutionary algorithm for large scale multi-objective optimization

A reinforcement learning-based multi-objective optimization in an interval and dynamic environment

Multi-objective Longitudinal Decision-making for Autonomous Electric Vehicle: A Entropy-constrained Reinforcement Learning Approach.

Inverse Multiobjective Optimization Through Online Learning

Risk-averse Learning with Non-Stationary Distributions

Robust Multiobjective Reinforcement Learning Considering Environmental Uncertainties

Learning to Expand/Contract Pareto Sets in Dynamic Multi-Objective Optimization With a Changing Number of Objectives

A Dynamic Resource Allocation Strategy with Reinforcement Learning for Multimodal Multi-objective Optimization

Approximating Pareto Frontier Through Bayesian-optimization-directed Robust Multi-objective Reinforcement Learning

Redefined decision variable analysis method for large-scale optimization and its application to feature selection

Online Resource Allocation with Convex-set Machine-Learned Advice

Long-term Fairness For Real-time Decision Making: A Constrained Online Optimization Approach

Risk-Averse No-Regret Learning in Online Convex Games

Decision-Oriented Learning with Differentiable Submodular Maximization for Vehicle Routing Problem

Inventory Balancing with Online Learning

Simultaneously Achieving Sublinear Regret and Constraint Violations for Online Convex Optimization with Time-varying Constraints

Efficient Constrained Regret Minimization

A Variance Minimization Approach to Temporal-Difference Learning

A reinforcement learning approach for dynamic multi-objective optimization

Non-Convex Bilevel Optimization with Time-Varying Objective Functions