RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation

Yu Xia,Ali Arian,Sriram Narayanamoorthy,Joshua Mabry
2023-12-22
Abstract:Significant research effort has been devoted in recent years to developing personalized pricing, promotions, and product recommendation algorithms that can leverage rich customer data to learn and earn. Systematic benchmarking and evaluation of these causal learning systems remains a critical challenge, due to the lack of suitable datasets and simulation environments. In this work, we propose a multi-stage model for simulating customer shopping behavior that captures important sources of heterogeneity, including price sensitivity and past experiences. We embedded this model into a working simulation environment -- RetailSynth. RetailSynth was carefully calibrated on publicly available grocery data to create realistic synthetic shopping transactions. Multiple pricing policies were implemented within the simulator and analyzed for impact on revenue, category penetration, and customer retention. Applied researchers can use RetailSynth to validate causal demand models for multi-category retail and to incorporate realistic price sensitivity into emerging benchmarking suites for personalized pricing, promotions, and product recommendations.
Applications,Artificial Intelligence,Machine Learning,Econometrics
What problem does this paper attempt to address?
The paper attempts to address the issue of the lack of suitable benchmark datasets and simulation environments in the evaluation of retail AI systems. Specifically, existing public datasets are small in scale, biased, and lack key fields, making it difficult to reliably evaluate complex systems such as personalized pricing, promotions, and product recommendation algorithms. To solve this problem, the authors propose a multi-stage model to simulate customer shopping behavior and embed it into a working simulation environment called RetailSynth. The synthetic shopping transaction data generated by RetailSynth can be used to validate causal demand models, evaluate the impact of different pricing strategies on revenue, category penetration, and customer retention, and provide researchers with a tool to test the robustness of AI systems. ### Main contributions of the paper include: 1. **Multi-stage model**: A multi-stage decision framework covering various stages of the customer lifecycle, including whether to visit the store, selecting the category to purchase, choosing the product to buy, and the quantity to purchase. 2. **Synthetic data generation**: Development of an interpretable multi-stage decision model capable of generating synthetic customer trajectories for a large number of products while maintaining efficient computational performance. 3. **Price sensitivity modeling**: Introduction of heterogeneous price sensitivity for customers and products in the model, making the generated data more consistent with real-world shopping behavior. 4. **Calibration and validation**: Detailed description of how to calibrate the model to public grocery data and comparison of the choice distribution and overall purchasing behavior of synthetic data with real data. 5. **Scenario analysis**: Demonstration of changes in customer demand through the simulation of different pricing strategies and validation of the model's heterogeneous response in different customer segments. ### Background and motivation of the paper: With the development of digital marketing and e-commerce, retailers invest significant resources in developing AI systems for sales promotions, dynamic pricing, product search and recommendation services, and online advertising. However, the reliable evaluation and benchmarking of these systems face many challenges, mainly due to the lack of suitable benchmark datasets and simulation environments. Existing public datasets are usually small in scale, biased, and lack key fields, making it impossible to comprehensively and accurately simulate customer behavior. Additionally, privacy and competition issues also limit the sharing of high-quality data. Therefore, developing a simulation environment capable of generating synthetic data is of great significance for accelerating the evaluation and optimization of retail AI systems.