Abstract:This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the $\varepsilon$-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either $O(T^{1/2})$ regret or classical $O(T^{1/2})$-consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal $O(\log T)$ regret and $O(T^{1/2})$-consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy's value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to simultaneously achieve regret minimization and statistical inference in high - dimensional online decision - making. Specifically, the research focuses on the Sparse Linear Contextual Bandit Model (LCB) and explores the interaction between regret minimization and statistical inference of parameter estimation in the decision - making process. ### Main problems of the paper 1. **Regret minimization**: In a high - dimensional data environment, how to design algorithms to minimize regret (i.e., the loss compared to the optimal strategy) in the decision - making process. Regret is usually defined as the gap between the cumulative rewards of the algorithm and the cumulative rewards of the optimal strategy. 2. **Statistical inference**: How to perform effective statistical inference on the bandit parameters in a high - dimensional data environment, including constructing confidence intervals and hypothesis testing. This involves accurately characterizing the bias and variance of parameter estimates. 3. **Trade - off between exploration and exploitation**: In the decision - making process, how to balance the relationship between exploring new information and exploiting existing information to simultaneously achieve good regret performance and statistical inference efficiency. ### Main contributions 1. **General inference framework and regret trade - off**: - Proposes a new statistical inference framework, combining the ε - greedy bandit algorithm and the Hard Thresholding (HT) method, for handling adaptively collected high - dimensional data. - Introduces an online de - biasing technique based on Inverse Propensity Weighting (IPW) to reduce the bias introduced by adaptive data collection and implicit regularization. - Analyzes the trade - off relationship between regret performance and the asymptotic variance of estimators, and points out that in some cases, the regret upper bound is $O(T^{1-\gamma})$ and the asymptotic variance of the estimator is $O(T^{-\gamma})$. 2. **Simultaneously achieving optimal regret and inference**: - Under the assumption of Covariate Diversity (CD), shows that the pure - greedy algorithm (i.e., without exploration) combined with the average - weighted de - biasing estimator can simultaneously achieve the optimal $O(\log T)$ regret upper bound and $O(T^{- 1/2})$ uniform inference. - Proposes an inference method for the optimal policy value (Q - value) to evaluate the maximum total reward of the optimal policy. 3. **Empirical results**: - Verifies the effectiveness of the algorithm and inference framework through numerical simulations and real - data experiments, especially in the application of warfarin dose adjustment, and identifies several key variables that significantly affect the dose. ### Application examples - **Warfarin dose adjustment**: By collecting high - dimensional data of patients (such as gender, height, weight, ethnicity, drug interactions, medical history, biomarkers, etc.), optimize the dose for each patient and improve the treatment effect. - **Marketing strategy**: By analyzing customer characteristics (such as demographic information, purchase history, past activity responses, etc.), determine the customer groups most likely to respond to marketing activities and improve resource utilization efficiency. - **Air ticket cancellation**: By analyzing high - dimensional data of customers (such as gender, age, travel purpose, ticket price, discount, origin and destination, etc.), evaluate the cancellation risk and formulate effective risk management strategies. ### Related work - **High - dimensional linear regression and statistical inference**: Discusses the de - biasing LASSO estimator in high - dimensional linear regression and its application in confidence intervals and hypothesis testing. - **Contextual multi - armed bandit and statistical inference**: Reviews the statistical inference methods for low - dimensional or stochastic contextual multi - armed bandits in the existing literature. - **Variance stabilization in adaptive experiments**: Introduces variance stabilization techniques in adaptive data collection, especially the Inverse Propensity Weighting method and its improvement strategies. In conclusion, this paper solves the problems of regret minimization and statistical inference in high - dimensional online decision - making by proposing new algorithms and inference frameworks, and verifies their effectiveness in practical applications.

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

Regret Minimization via Saddle Point Optimization

Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting

Dimension Reduction in Contextual Online Learning Via Nonparametric Variable Selection

Information Directed Sampling for Sparse Linear Bandits

Optimal Regret Is Achievable with Bounded Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

Multi-Armed Bandits with Network Interference

Efficient Constrained Regret Minimization

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Model-Assisted Uniformly Honest Inference for Optimal Treatment Regimes in High Dimension

Variance-Aware Sparse Linear Bandits.

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization

Online Policy Learning and Inference by Matrix Completion

Regret Analysis of Bandit Problems with Causal Background Knowledge

Dynamic Selection in Algorithmic Decision-making

High-dimensional Linear Bandits with Knapsacks

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits.

Self-fulfilling Bandits: Endogeneity Spillover and Dynamic Selection in Algorithmic Decision-making

Bayesian Regret Minimization in Offline Bandits