Abstract:We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is batch learning from bandit feedback in an extremely large - scale action space. Specifically, the paper focuses on how to effectively utilize the data to improve the model in applications such as recommendation systems, where billions of decisions need to be made on millions of options every day, generating a large amount of observational data. ### Problem Background 1. **Differences between Bandit Feedback and Supervised Learning** - In traditional supervised learning, each data point is accompanied by a label, which provides a strong feedback mechanism. The model can evaluate the loss of its selected actions and the potential loss of unselected actions. - In bandit feedback, however, the training data only provides evaluations of the selected actions, without clearly indicating the correct actions. This form of feedback is very common in large - scale application scenarios such as recommendation systems and online markets. 2. **Extremely Large - Scale Action Space** - Recommendation systems in the real world may involve billions of products and millions of users, resulting in an extremely large action space. - In this case, although traditional supervised learning methods (such as extreme multi - label classification, XMC) are widely used, there are significant biases because of the mismatch between bandit feedback and supervised labels. 3. **Problems with Importance Sampling** - To alleviate the bias, importance sampling techniques are usually used. However, when the action space is very large, the variance of importance sampling becomes unacceptably high, resulting in unstable estimation results. ### Main Contributions of the Paper 1. **Selective Importance Sampling (sIS)** - A new selective importance sampling estimator (sIS) is introduced. This estimator reduces the variance by performing importance sampling on a small fraction of actions for each instance. - Specifically, the sIS estimator is achieved by performing importance sampling on the conditional expectation of rewards, which are calculated relative to a small fraction of actions for each instance. 2. **Policy Optimization for eXtreme Models (POXM)** - Based on the sIS estimator, a new algorithm POXM is proposed for learning XMC tasks from bandit feedback. - In POXM, the selected actions are the top \( p \) actions of the logging policy, where \( p \) is adjusted from the data and is much smaller than the size of the action space. ### Experimental Verification - The paper conducted experiments on three XMC datasets (EUR - Lex, Wiki10 - 31K, Amazon - 670K) and compared the POXM method with three competing methods: BanditNet, the partial matching pruning strategy, and the supervised learning baseline. - The experimental results show that POXM exhibits significant performance improvements over all baselines, especially in extremely large - scale action spaces. ### Conclusion By introducing selective importance sampling and the POXM algorithm, this paper effectively solves the problem of batch learning from bandit feedback in extremely large - scale action spaces, providing new solutions for large - scale application scenarios such as recommendation systems.

Learning from eXtreme Bandit Feedback

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy.

Minimax Off-Policy Evaluation for Multi-Armed Bandits

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

Anytime-valid off-policy inference for contextual bandits

Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling

Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Sequential Monte Carlo Bandits

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance

Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions

Safe Exploration for Optimizing Contextual Bandits

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials

Bandit Learning with Implicit Feedback.

Learning Contextual Bandits in a Non-stationary Environment

Reinforcement Learning Augmented Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits.

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support