Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding

Haoming Li,Yusen Huo,Shuai Dou,Zhenzhe Zheng,Zhilin Zhang,Chuan Yu,Jian Xu,Fan Wu
2024-04-08
Abstract:In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments. To narrow this gap, we can deploy multiple auto-bidding agents in parallel to collect a large interaction dataset. Offline RL algorithms can then be utilized to train a new policy. The trained policy can subsequently be deployed for further data collection, resulting in an iterative training framework, which we refer to as iterative offline RL. In this work, we identify the performance bottleneck of this iterative offline RL framework, which originates from the ineffective exploration and exploitation caused by the inherent conservatism of offline RL algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL from a trajectory perspective. Furthermore, to ensure the safety of online exploration while preserving the dataset quality for TEE, we propose Safe Exploration by Adaptive Action Selection (SEAS). Both offline experiments and real-world experiments on Alibaba display advertising platform demonstrate the effectiveness of our proposed method.
Machine Learning,Artificial Intelligence,Computer Science and Game Theory,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in the auto - bidding of online advertising, how to improve the performance of auto - bidding algorithms through the iterative offline reinforcement learning (IRL) framework and ensure its safety in actual deployment. Specifically, current auto - bidding strategies based on reinforcement learning are usually trained in simulated environments, resulting in a decline in performance when deployed in real - world environments. To narrow this gap, the authors propose a new method aimed at overcoming the performance bottlenecks in existing iterative offline RL frameworks. ### Main problems: 1. **Performance bottleneck**: Due to the conservatism principle, existing iterative offline RL frameworks are not effective in exploration and exploitation, resulting in fewer high - quality trajectories in the collected data sets, which in turn affects the performance of the training strategies. 2. **Safety**: When conducting online exploration in actual advertising systems, it is necessary to ensure the safety of the data collection strategy and avoid the adverse effects caused by exploration behaviors. ### Solutions: To solve the above problems, the authors propose the following two key components: 1. **Trajectory - wise Exploration and Exploitation (TEE)**: - **Parameter Space Noise (PSN)**: By introducing noise in the policy parameter space instead of the traditional Action Space Noise (ASN), more high - quality trajectories are generated. PSN can inject noise once at the beginning of each episode and remain unchanged throughout the episode, thus ensuring the consistency of exploration behaviors. - **Robust Trajectory Weighting**: By weighting the collected trajectories, more attention is paid to high - quality trajectories during the training process, thereby alleviating the limitations brought by conservatism. Specifically, the predicted expected rewards are used instead of the actual rewards to calculate the quality indicators of the trajectories, and the sampling probabilities are adjusted according to these indicators. 2. **Safe Exploration by Adaptive Action Selection (SEAS)**: - SEAS ensures that each exploration is carried out within a safe range by adaptively selecting safe actions or exploration actions, while retaining high - quality exploration behaviors as much as possible. Specifically, SEAS will dynamically decide whether to adopt exploration actions based on the cumulative rewards and the predicted future returns, thereby maximizing the quality of the data set while ensuring safety. ### Summary: The main contributions of this paper are: - Identifying and demonstrating the performance bottlenecks of current iterative offline RL frameworks in auto - bidding, which are mainly caused by ineffective exploration and exploitation due to the conservatism principle. - Proposing the TEE framework, including PSN and the robust trajectory - weighting algorithm, which effectively improves the exploration and exploitation efficiency in iterative offline RL. - Designing the SEAS algorithm to ensure the safety of online exploration while minimizing the impact on the performance of policy learning. - The experimental results show that the proposed method performs better than existing methods in both the simulated environment and Alibaba's display advertising platform. Through these improvements, this paper provides a more efficient and safe solution for auto - bidding in online advertising.