Abstract:We study the repeated principal-agent bandit game, where the principal indirectly interacts with the unknown environment by proposing incentives for the agent to play arms. Most existing work assumes the agent has full knowledge of the reward means and always behaves greedily, but in many online marketplaces, the agent needs to learn the unknown environment and sometimes explore. Motivated by such settings, we model a self-interested learning agent with exploration behaviors who iteratively updates reward estimates and either selects an arm that maximizes the estimated reward plus incentive or explores arbitrarily with a certain probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $\widetilde{O}(\sqrt{T})$ and $\widetilde{O}( T^{2/3} )$, respectively. Specifically, these algorithms are established upon a novel elimination framework coupled with newly-developed search algorithms which accommodate the uncertainty arising from the learning behavior of the agent. We then extend the framework to handle the exploratory learning agent and develop an algorithm to achieve a $\widetilde{O}(T^{2/3})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when reducing our agent behaviors to the one studied in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $\widetilde{O}(\sqrt{T})$ regret bound, significantly improving upon their $\widetilde{O}(T^{11/12})$ bound.

Neural Active Learning Beyond Bandits

Neural Active Learning with Performance Guarantees

Empirical analysis of representation learning and exploration in neural kernel bandits

Streaming Active Learning with Deep Neural Networks

Active Learning for Streaming Networked Data

Effective Active Learning Method for Spiking Neural Networks.

Active Multi-task Learning Via Bandits.

Meta-Learning Adversarial Bandit Algorithms

Deep Individual Active Learning: Safeguarding Against Out-of-Distribution Challenges in Neural Networks

Random Walk Bandits.

Neural Active Learning on Heteroskedastic Distributions

Neural Exploitation and Exploration of Contextual Bandits

Bandit Online Learning on Graphs Via Adaptive Optimization

Meta-Learning Transferable Active Learning Policies by Deep Reinforcement Learning

Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents

Learning Generative State Space Models for Active Inference

Neural Active Learning Meets the Partial Monitoring Framework

Robust online active learning

Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits

Social Bandit Learning: Strangers Can Help

AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets