Abstract:We utilize an offline reinforcement learning (RL) model for sequential targeted promotion in the presence of budget constraints in a real-world business environment. In our application, the mobile app aims to boost customer retention by sending cash bonuses to customers and control the costs of such cash bonuses during each time period. To achieve the multi-task goal, we propose the Budget Constrained Reinforcement Learning for Sequential Promotion (BCRLSP) framework to determine the value of cash bonuses to be sent to users. We first find out the target policy and the associated Q-values that maximizes the user retention rate using an RL model. A linear programming (LP) model is then added to satisfy the constraints of promotion costs. We solve the LP problem by maximizing the Q-values of actions learned from the RL model given the budget constraints. During deployment, we combine the offline RL model with the LP model to generate a robust policy under the budget constraints. Using both online and offline experiments, we demonstrate the efficacy of our approach by showing that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines. Taking advantage of the near real-time cost control method, the proposed framework can easily adapt to data with a noisy behavioral policy and/or meet flexible budget constraints.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve sequential targeted promotion through an offline reinforcement learning (RL) model in the presence of budget constraints, in order to improve customer retention rate and control the cost of cash rewards. Specifically, the research aims to design a strategy for mobile applications to increase the long - term customer retention rate by sending personalized cash rewards to users, while ensuring that the cost within each time period does not exceed the set budget. ### Main problem description 1. **Customer retention and cost control**: In the actual business environment, mobile applications need to send cash rewards to improve the customer retention rate, but at the same time, they must also control the cost of these cash rewards. 2. **Multi - task objectives under budget constraints**: The research not only pursues maximizing the customer retention rate, but also needs to ensure that the promotion cost does not exceed the budget limit. This is a multi - task optimization problem, which requires both optimality and feasibility. 3. **Application of offline data**: Since online training is costly and risky, the research uses offline data for model training and testing to avoid negative impacts on real users. ### Research methods To solve the above problems, the author proposes a framework named BCRLSP (Budget Constrained Reinforcement Learning for Sequential Promotion), which combines two methods: offline reinforcement learning and linear programming: - **Offline reinforcement learning (RL) module**: Train the RL model through offline data to determine the target policy that maximizes the user retention rate and its corresponding Q - value. - **Linear programming (LP) module**: Under the given budget constraints, use the linear programming model to further optimize the action Q - values learned by the RL model to ensure that the total cost does not exceed the budget. ### Mathematical formula representation 1. **MDP definition**: - State \( s_t \) represents the user's characteristics. - Action \( a_t \) represents the amount of cash rewards. - Reward \( r_t \) is a 0 - 1 value, indicating whether the user logs in the next day. - Cost \( c_t \) represents the cost of each action. 2. **Objective function**: - Maximize the cumulative user retention rate while satisfying the budget constraint: \[ \text{arg max}_{\pi} \mathbb{E}\left[\sum_{t = 1}^{T} \gamma^t r_t\mid\pi, s_0\right]\quad\text{s.t.}\quad\mathbb{E}\left[\sum_{t = 1}^{T} c_t\mid\pi, s_0\right]\leq B \] where \( B \) is the budget upper limit and \( \gamma \) is the discount factor. 3. **Linear programming model**: - The objective is to maximize the sum of Q - values of all users while satisfying the average cost constraint: \[ \text{arg max}\sum_{i = 1}^{N}\sum_{j = 1}^{M}q_{ij}x_{ij} \] \[ \text{s.t.}\quad\sum_{i = 1}^{N}\sum_{j = 1}^{M}c_jx_{ij}\leq N\bar{c},\quad\sum_{j = 1}^{M}x_{ij} = 1,\quad x_{ij}\in\{0, 1\} \] where \( q_{ij} \) is the Q - value of user \( i \) taking action \( j \) in state \( s_i \), \( c_j \) is the cost of action \( j \), and \( \bar{c} \) is the average cost set by the company. ### Conclusion By combining offline reinforcement learning and linear programming, B

BCRLSP: An Offline Reinforcement Learning Framework for Sequential Targeted Promotion

Beyond Reward: Offline Preference-guided Policy Optimization

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Behavior Proximal Policy Optimization

A Rank-Based Sampling Framework for Offline Reinforcement Learning

Deep Reinforcement Learning for Sequential Targeting

Marketing Budget Allocation with Offline Constrained Deep Reinforcement Learning

Safe Offline Reinforcement Learning with Real-Time Budget Constraints

AI Agents for Sequential Promotions: Combining Deep Reinforcement Learning and Dynamic Field Experimentation

Offline Prioritized Experience Replay

Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail Promotions

Decoupled Prioritized Resampling for Offline RL

Optimizing Enhanced Cost Per Click Via Reinforcement Learning Without Exploration

Optimizing Digital Coupon Assignment Using Constrained Reinforcement Learning.

Offline Planning and Online Learning Under Recovering Rewards

Mind the Gap: Offline Policy Optimization for Imperfect Rewards.

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

Boosting Offline Reinforcement Learning via Data Rebalancing

Reinforcing User Retention in a Billion Scale Short Video Recommender System

Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

CROP: Conservative Reward for Model-based Offline Policy Optimization