Abstract:The recent rising popularity of ultra-fast delivery services on retail platforms fuels the increasing use of urban warehouses, whose proximity to customers makes fast deliveries viable. The space limit in urban warehouses poses a problem for the online retailers: the number of products (SKUs) they carry is no longer "the more, the better", yet it can still be significantly large, reaching hundreds or thousands in a product category. In this paper, we study algorithms for dynamically identifying a large number of products (i.e., SKUs) with top customer purchase probabilities on the fly, from an ocean of potential products to offer on retailers' ultra-fast delivery platforms. We distill the product selection problem into a semi-bandit model with linear generalization. There are in total $N$ different arms, each with a feature vector of dimension $d$. The player pulls $K$ arms in each period and observes the bandit feedback from each of the pulled arms. We focus on the setting where $K$ is much greater than the number of total time periods $T$ or the dimension of product features $d$. We first analyze a standard UCB algorithm and show its regret bound can be expressed as the sum of a $T$-independent part $\tilde O(K d^{3/2})$ and a $T$-dependent part $\tilde O(d\sqrt{KT})$, which we refer to as "fixed cost" and "variable cost" respectively. To reduce the fixed cost for large $K$ values, we propose a novel online learning algorithm, which iteratively shrinks the upper confidence bounds within each period, and show its fixed cost is reduced by a factor of $d$ to $\tilde O(K \sqrt{d})$. Moreover, we test the algorithms on an industrial dataset from Alibaba Group. Experimental results show that our new algorithm reduces the total regret of the standard UCB algorithm by at least 10%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problem of dynamically selecting the optimal product set in urban warehouses. Specifically, with the increasing popularity of ultra - fast delivery services on retail platforms, urban warehouses have become the key to achieving rapid delivery because of their proximity to consumers. However, the space limitations of urban warehouses prevent online retailers from offering as many product categories (i.e., SKUs) as traditional e - commerce platforms. Therefore, how to dynamically identify a large number of products with the highest probability of customer purchase from a large number of potential products has become an urgent problem to be solved. The paper refines this product selection problem into a semi - bandit model with linear generalization. There are $N$ arms in the model (each product is regarded as an arm), and each arm has a $d$ - dimensional feature vector. The player selects $K$ arms in each cycle and observes the bandit feedback of each selected arm. The focus of the study is on the case where $K$ is much larger than the total time cycle $T$ or the product feature dimension $d$. To reduce the fixed cost, the paper proposes a new online learning algorithm - ConsUCB (Conservative Upper Confidence Bound), which reduces the fixed cost by iteratively shrinking the upper confidence bound in each cycle. Experimental results show that compared with the standard UCB algorithm, the new algorithm can reduce the total regret value by at least 10%. ### Specific Problem Description 1. **Background**: With the rise of ultra - fast delivery services, urban warehouses have become the key to rapid delivery because of their geographical location advantages. However, due to the limited space in urban warehouses, retailers cannot increase product categories without limit and need to select the products most likely to be purchased within the limited space. 2. **Objective**: Dynamically select the optimal product set from a large number of potential products to maximize the probability of meeting customer needs and thus improve sales performance. 3. **Challenges**: - There are a large number of products, and each category may have hundreds to thousands of products. - The sales data of products are limited, especially the sales data of low - demand products. - Multiple adjustments need to be made within a limited time cycle (such as a quarter) to adapt to market changes. 4. **Method**: The paper proposes a semi - bandit model with linear generalization and analyzes the regret bound of the standard UCB algorithm. On this basis, a new algorithm ConsUCB is proposed, which reduces the fixed cost by gradually shrinking the upper confidence bound in each cycle. 5. **Contributions**: - Provide a new regret bound analysis of the standard UCB algorithm under specific model settings. - Propose a new algorithm ConsUCB, which significantly reduces the fixed cost. - Verify the effectiveness of the new algorithm through Alibaba Group's data. ### Mathematical Model - **Model Setup**: - $N$: The total number of candidate products. - $T$: The total number of time cycles. - $K$: The number of products selected in each cycle. - $d$: The dimension of the feature vector of each product. - $\theta^*$: The unknown parameter vector, representing the linear relationship between the product's sales probability and its feature vector. - **Objective Function**: Maximize the expected reward in each cycle, that is, the sum of the positive sales probabilities of the selected products. - **Regret Definition**: \[ R(T)=\sum_{t = 1}^{T}\sum_{i\in S^*}\mu(i)-\sum_{t = 1}^{T}\sum_{i\in S_t}\mu(i) \] where $S^*$ is the optimal product set and $S_t$ is the product set selected in the $t$-th cycle. ### New Algorithm ConsUCB - **Core Idea**: Reduce the fixed cost by gradually shrinking the upper confidence bound in each cycle. - **Fixed Cost**: The fixed cost of the standard UCB algorithm is $\tilde{O}(Kd^{3/2})$, while the fixed cost of the ConsUCB algorithm is reduced to $\tilde{O}(K\sqrt{d})$.

Shrinking the Upper Confidence Bound: A Dynamic Product Selection Problem for Urban Warehouses

Online Learning and Pricing for Multiple Products with Reference Price Effects

The Big Data Newsvendor Problem under Demand and Yield Uncertainties

Low-Rank Online Dynamic Assortment with Dual Contextual Information

A Concept for Optimal Warehouse Allocation Using Contextual Multi-Arm Bandits

Robust Dynamic Assortment Optimization in the Presence of Outlier Customers

MNL-Bandit with Knapsacks: a near-optimal algorithm

Dynamic Assortment Optimization with Changing Contextual Information

The Competitive Ratio of Threshold Policies for Online Unit-density Knapsack Problems

MNL-Bandits under Inventory and Limited Switches Constraints

A robust optimization approach for multi-product inventory management in a dual-channel warehouse under demand uncertainties

Dynamic Assortment with Online Learning under Threshold Multinomial Logit Model

An Integer L-shaped Method for Dynamic Order Fulfillment in Autonomous Last-Mile Delivery with Demand Uncertainty

Product Packing and Stacking under Uncertainty: A Robust Approach

Inventory Balancing with Online Learning

Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce

Dynamic Assortment Optimization for Reusable Products with Random Usage Durations

Optimal Policies for Dynamic Pricing and Inventory Control with Nonparametric Censored Demands

Learning to Order for Inventory Systems with Lost Sales and Uncertain Supplies

Spatial and temporal optimization for smart warehouses with fast turnover

Product Ranking for Revenue Maximization with Multiple Purchases