Abstract:We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

Anytime-valid off-policy inference for contextual bandits

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support

Distributionally Robust Batch Contextual Bandits

Learning Contextual Bandits in a Non-stationary Environment

Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem

Distributionally Robust Policy Evaluation under General Covariate Shift in Contextual Bandits

Safe Exploration for Efficient Policy Evaluation and Comparison

Selectively Contextual Bandits

Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning

CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning

Deep Contextual Multi-armed Bandits

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

Imitation-Regularized Offline Learning

Inverse Contextual Bandits: Learning How Behavior Evolves over Time

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits

Contextual Bandit with Adaptive Feature Extraction

When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

Optimal Baseline Corrections for Off-Policy Contextual Bandits

Learning from eXtreme Bandit Feedback