Junyan Liu,Lillian J. Ratliff
Abstract:We study the repeated principal-agent bandit game, where the principal indirectly interacts with the unknown environment by proposing incentives for the agent to play arms. Most existing work assumes the agent has full knowledge of the reward means and always behaves greedily, but in many online marketplaces, the agent needs to learn the unknown environment and sometimes explore. Motivated by such settings, we model a self-interested learning agent with exploration behaviors who iteratively updates reward estimates and either selects an arm that maximizes the estimated reward plus incentive or explores arbitrarily with a certain probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $\widetilde{O}(\sqrt{T})$ and $\widetilde{O}( T^{2/3} )$, respectively. Specifically, these algorithms are established upon a novel elimination framework coupled with newly-developed search algorithms which accommodate the uncertainty arising from the learning behavior of the agent. We then extend the framework to handle the exploratory learning agent and develop an algorithm to achieve a $\widetilde{O}(T^{2/3})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when reducing our agent behaviors to the one studied in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $\widetilde{O}(\sqrt{T})$ regret bound, significantly improving upon their $\widetilde{O}(T^{11/12})$ bound.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to handle the situation where the agent has self - interested learning behavior and exploration behavior in the principal - agent multi - armed bandit (principal - agent bandit games) framework. Specifically, existing research usually assumes that the agent fully understands the expected value of the reward and always greedily selects the option that maximizes the expected reward plus the incentive. However, in many practical scenarios, such as the online market, the agent needs to learn the unknown environment and sometimes conducts exploration. Therefore, this paper aims to:
1. **Expand the agent's learning behavior**: Consider a self - interested learning agent. This agent may explore non - maximizer arms during the learning process, and its selection is based not only on the true expected reward but also on the empirical mean plus the incentive.
2. **Design an improved algorithm**: For this more realistic agent model, design a new algorithm that can exceed the \(\tilde{O}(T^{11/12})\) regret bound proposed in existing research (such as Dogan et al., 2023a).
### Main contributions
1. **I.I.D. reward setting**:
- Propose a new algorithm (Algorithm 1) suitable for self - interested learning agents without exploration behavior, achieving an expected regret bound of \(O(\sqrt{KT\log(KT)})\).
- Further propose a new algorithm (Algorithm 5) suitable for self - interested learning agents with exploration behavior, achieving a regret bound of \(\tilde{O}(K^{1/3}T^{2/3})\), which is significantly better than \(\tilde{O}(T^{11/12})\) of Dogan et al. (2023a).
2. **Linear reward setting**:
- Propose a new algorithm (Algorithm 3) suitable for linear reward settings, achieving an expected regret bound of \(\tilde{O}(d^{4/3}T^{2/3})\), where \(d\) is the dimension.
### Key challenges and solutions
1. **Challenge 1 (C1)**: Since the optimal incentive changes over time, it is impossible to search for the optimal incentive all at once. Therefore, it is necessary to find the appropriate time to search in order to balance accuracy and efficiency.
2. **Challenge 2 (C2)**: Due to the volatility of the empirical mean update, it is difficult to accurately search for the optimal incentive. To solve this problem, it is necessary to make the agent explore all arms more comprehensively through an appropriate incentive strategy, thereby stabilizing the estimator, but without causing too much regret.
### Elimination framework
To address the above challenges, the author proposes a new phased - elimination scheme. This framework includes three main components:
- **Stabilize the estimators of bad arms**: In each stage, the algorithm will play all bad arms moderately to stabilize their estimators.
- **Search for approximately optimal incentives**: As the stage progresses, gradually reduce the search error.
- **Online elimination**: Identify and eliminate poorly performing arms and enter the next stage.
In this way, the algorithm can effectively reduce regret while ensuring accuracy.