Abstract:We consider the periodic review dynamic pricing and inventory control problem with fixed ordering cost. Demand is random and price dependent, and unsatisfied demand is backlogged. With complete demand information, the celebrated (s,S,p) policy is proved to be optimal, where s and S are the reorder point and order-up-to level for ordering strategy, and p, a function of on-hand inventory level, characterizes the pricing strategy. In this paper, we consider incomplete demand information and develop online learning algorithms whose average profit approaches that of the optimal (s,S,p) with a tight O(T^(1/2)) regret rate. A number of salient features differentiate our work from the existing online learning researches in the OM literature. First, computing the optimal (s,S,p) policy requires solving a dynamic programming (DP) over multiple periods involving unknown quantities, which is different from the majority of learning problems in operations management that only require solving single-period optimization questions. It is hence challenging to establish stability results through DP recursions, which we accomplish by proving uniform convergence of the profit-to-go function. The necessity of analyzing action-dependent state transition over multiple periods resembles the reinforcement learning question, considerably more difficult than existing bandit learning algorithms. Second, the pricing function p is of infinite dimension, and approaching it is much more challenging than approaching a finite number of parameters as seen in existing researches. The demand-price relationship is estimated based on upper confidence bound, but the confidence interval cannot be explicitly calculated due to the complexity of the DP recursion. Finally, due to the multi-period nature of (s,S,p) policies the actual distribution of the randomness in demand plays an important role in determining the optimal pricing strategy p, which is unknown to the learner a priori. In this paper, the demand randomness is approximated by an empirical distribution constructed using dependent samples, and a novel Wasserstein metric based argument is employed to prove convergence of the empirical distribution.

Analytical Solution to A Discrete-Time Model for Dynamic Learning and Decision-Making

POMDPs in Continuous Time and Discrete Spaces

Model-free Adaptive Dynamic Programming for Optimal Control of Discrete-time Affine Nonlinear System

Numerical method to solve impulse control problems for partially observed piecewise deterministic Markov processes

Dynamic Teaching in Sequential Decision Making Environments

Control Theory Meets POMDPs: A Hybrid Systems Approach

Dynamic Pricing and Inventory Control with Fixed Ordering Cost and Incomplete Demand Information

Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight

Bridging the Gap between Partially Observable Stochastic Games and Sparse POMDP Methods

Overcoming Delayed Feedback Via Overlook Decision Making

A Scalable Model-Free Recurrent Neural Network Framework for Solving POMDPs

Markov Decision Processes with Time-Varying Geometric Discounting

PODDP: Partially Observable Differential Dynamic Programming for Latent Belief Space Planning

Towards Analysis Of Semi-Markov Decision Processes

Dynamic Programming for Structured Continuous Markov Decision Problems

Recursively-Constrained Partially Observable Markov Decision Processes

Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov Decision Processes

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

Economic Model Predictive Control as a Solution to Markov Decision Processes

GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond

Prospective Side Information for Latent MDPs