Boosted Off-Policy Learning

Ben London,Levi Lu,Ted Sandler,Thorsten Joachims

DOI: https://doi.org/10.48550/arXiv.2208.01148

2023-05-03

Abstract:We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a ''weak'' learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.

Machine Learning,Information Retrieval

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: how to conduct off - policy learning through the boosting algorithm in logged bandit feedback data so as to optimize the expected reward of the policy. Specifically, the author proposes a boosting algorithm specifically designed for off - policy learning - **Boosted Off - Policy Learning (BOPL)**. ### Main problems and challenges 1. **Limitations of existing methods**: - Existing boosting methods are mainly applied to supervised learning, while off - policy learning involves the contextual bandit problem, in which the feedback depends on the actions taken. - Traditional regression methods (such as reward regression) can predict rewards, but may not necessarily produce better policies, because minimizing the squared error is not necessarily equivalent to maximizing the expected reward. 2. **Directly optimizing the expected reward**: - The BOPL algorithm directly optimizes the estimated value of the expected reward of the policy, rather than indirectly deriving the policy through regression. This can improve policy performance more directly. 3. **Theoretical guarantees**: - The paper proves that under the condition of "weak learning", the empirical risk will decrease (possibly exponentially) after each round of boosting. - An upper bound of the smooth loss function is proposed, and the specific form of the boosting algorithm is derived through this upper bound. 4. **Advantages in practical applications**: - Experiments show that the BOPL algorithm outperforms off - policy learning methods based on deep neural networks and simple reward regression methods on multiple public datasets. - Boosted ensemble policies have the advantages of strong robustness, simple parameter adjustment, and short training time, which are suitable for practical applications. ### Formula representation - **Risk definition**: \[ L(\pi)=\mathbb{E}_{x\sim D_x}\mathbb{E}_{a\sim\pi(x)}\mathbb{E}_{r\sim D_r}[-r(x, a)] \] - **Empirical risk estimation**: \[ \hat{L}(\pi, S)=\frac{1}{n}\sum_{i = 1}^n-\frac{r_i\pi(a_i|x_i)}{p_i} \] - **Optimal weights**: \[ \alpha_t^{\star}=\frac{2}{n}\frac{\sum_{i = 1}^n\frac{r_i}{p_i}\pi_{t - 1}(a_i|x_i)(a_i-\pi_{t - 1}(x_i))^Tf_t(x_i)}{\sum_{i = 1}^n\frac{|r_i|}{p_i}\|f_t(x_i)\|^2} \] Through these formulas and theoretical analysis, the paper shows how to use the boosting algorithm to achieve efficient and effective policy optimization in off - policy learning.

Boosted Off-Policy Learning

Behavior Proximal Policy Optimization

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy.

Boosting Nonparametric Policies.

Online Boosting with Bandit Feedback

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

Exponential Smoothing for Off-Policy Learning

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Learning from eXtreme Bandit Feedback

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Robust Offline Policy Learning with Observational Data from Multiple Sources

POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition

Anytime-valid off-policy inference for contextual bandits

Off-Policy Prediction Learning: An Empirical Study of Online Algorithms

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support

Pessimistic Off-Policy Optimization for Learning to Rank

CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning

Off-Policy Policy Gradient with State Distribution Correction

Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Minimax Adaptive Boosting for Online Nonparametric Regression