DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation

Jason Shuo Zhang,Benjamin Howson,Panayiota Savva,Eleanor Loh

2024-06-13

Abstract:Personalised discount codes provide a powerful mechanism for managing customer relationships and operational spend in e-commerce. Bandits are well suited for this product area, given the partial information nature of the problem, as well as the need for adaptation to the changing business environment. Here, we introduce DISCO, an end-to-end contextual bandit framework for personalised discount code allocation at ASOS. DISCO adapts the traditional Thompson Sampling algorithm by integrating it within an integer program, thereby allowing for operational cost control. Because bandit learning is often worse with high dimensional actions, we focused on building low dimensional action and context representations that were nonetheless capable of good accuracy. Additionally, we sought to build a model that preserved the relationship between price and sales, in which customers increasing their purchasing in response to lower prices ("negative price elasticity"). These aims were achieved by using radial basis functions to represent the continuous (i.e. infinite armed) action space, in combination with context embeddings extracted from a neural network. These feature representations were used within a Thompson Sampling framework to facilitate exploration, and further integrated with an integer program to allocate discount codes across ASOS's customer base. These modelling decisions result in a reward model that (a) enables pooled learning across similar actions, (b) is highly accurate, including in extrapolation, and (c) preserves the expected negative price elasticity. Through offline analysis, we show that DISCO is able to effectively enact exploration and improves its performance over time, despite the global constraint. Finally, we subjected DISCO to a rigorous online A/B test, and find that it achieves a significant improvement of >1% in average basket value, relative to the legacy systems.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

This paper introduces an end-to-end framework called DISCO for personalized discount allocation, specifically targeting customer relationship management and operation cost control in e-commerce. In this problem, due to the characteristics of partial information and the need to adapt to a constantly changing business environment, bandit algorithms are very suitable. DISCO integrates the traditional Thompson sampling algorithm into integer programming, allowing for control of operational costs and solving the problem of reduced learning efficiency caused by high-dimensional actions. DISCO uses radial basis functions to represent the continuous action space and combines context embeddings extracted by neural networks to achieve low-dimensional but highly expressive feature representation. This approach aims to preserve neighboring information between actions while maintaining prediction accuracy, including when extrapolating. In addition, the model design preserves the traditional negative price elasticity relationship between price and sales, which is a key indicator for evaluating the effectiveness of pricing models. The paper demonstrates through offline analysis that DISCO is able to effectively explore and improve performance over time despite global constraints. Online A/B testing shows that DISCO achieves over 1% significant improvement in average shopping cart value compared to traditional systems. Keywords include personalized discount codes, bandit algorithms, reinforcement learning, retail science, pricing, and Thompson sampling.

DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation

Doubly High-Dimensional Contextual Bandits: An Interpretable Model for Joint Assortment-Pricing

Contextual Combinatorial Bandit and Its Application on Diversified Online Recommendation

Selectively Contextual Bandits

A Contextual-Bandit Approach to Personalized News Article Recommendation

Improving Portfolio Optimization Results with Bandit Networks

Personalized Product Assortment with Real-time 3D Perception and Bayesian Payoff Estimation

Bandit Learning for Diversified Interactive Recommendation

Conversational Contextual Bandit: Algorithm and Application

Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations

Multi-Task Combinatorial Bandits for Budget Allocation

Counterfactual Data Augmentation for Debiased Coupon Recommendations Based on Potential Knowledge

Optimizing Digital Coupon Assignment Using Constrained Reinforcement Learning.

Optimizing Item-based Marketing Promotion Efficiency in C2C Marketplace with Dynamic Sequential Coupon Allocation Framework

Online Evaluation of Audiences for Targeted Advertising via Bandit Experiments

Online Personalized Assortment Optimization with High-Dimensional Customer Contextual Data

A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing

Survey of dynamic pricing based on Multi-Armed Bandit algorithms

Optimising Individual-Treatment-Effect Using Bandits

Hierarchical Conversational Preference Elicitation with Bandit Feedback

Adapting multi-armed bandits policies to contextual bandits scenarios