J.E. Avron,O. Kenneth
Abstract:We study the swimming of non-relativistic deformable bodies in (empty) static curved spaces. We focus on the case where the ambient geometry allows for rigid body motions. In this case the swimming equations turn out to be geometric. For a small swimmer, the swimming distance in one stroke is determined by the Riemann curvature times certain moments of the swimmer.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to minimize the player's regret in the multi - armed bandit problem when facing an adaptively adjusted reward mechanism. Specifically, the paper focuses on the fact that in each round, the player selects an "arm" (i.e., an option or decision) and receives a corresponding reward, but these rewards can depend on the player's previous choices. This dependence makes the problem more complex because the rewards are no longer independently and identically distributed.
### Problem Background
The multi - armed bandit problem is a classic model of iterative decision - making problems, used to describe situations where multiple decisions are made under uncertain conditions. Each time the player selects an "arm" and receives a corresponding reward, but only the reward of the selected arm will be informed, and the rewards of the other unselected arms are unknown. The goal is to minimize the player's total regret, which is defined as the gap between the total reward actually obtained by the player and the maximum reward that could be obtained by the best fixed arm in hindsight.
### Core Problem of the Paper
Most of the traditional work assumes that the rewards are non - adaptive, that is, the reward distribution of each arm has nothing to do with previous decisions. However, in a more general adaptive reward model, the reward of each arm can depend on the player's previous decision history. In this case, traditional algorithms may no longer be applicable, so new algorithms are needed to handle this situation.
### Main Contributions of the Paper
The paper proposes a new algorithm and proves that it can achieve a near - optimal upper bound of regret when facing both adaptive and non - adaptive rewards. Specifically:
1. **New Algorithm**: The paper proposes an algorithm named "Accounts", which effectively copes with adaptive rewards by introducing an "account" mechanism to control the exploration probability.
2. **Theoretical Guarantee**: For any adaptively selected reward sequence, the regret of the "Accounts" algorithm is \( O(\sqrt{T K \log K}) \) with high probability, where \( T \) is the number of rounds and \( K \) is the number of arms. This result matches the known best upper bound.
3. **Improved Upper Bound**: Compared with the previous best result \( O(T^{2/3}(K \log K)^{1/3}) \), the new algorithm significantly improves the performance.
4. **Lower Bound Analysis**: The paper also proves that for adaptive rewards, the expected regret of the Exp3 algorithm cannot be significantly improved, that is, \( \Omega(T^{2/3}) \).
### Formula Summary
- Definition of regret \( R \):
\[
R=\sum_{i = 1}^T c_{M_i}-\min_j\sum_{i = 1}^T c_{ij}
\]
- Upper bound of the regret of the "Accounts" algorithm:
\[
\Pr\left(R\geq(\alpha + 7)\sqrt{T K \ln K}\right)\leq1000K\sqrt{\alpha}\exp\left(-\frac{\sqrt{\alpha}\ln K}{8}\right)
\]
- Expected regret:
\[
E[R]=O(\sqrt{T K \ln K})
\]
Through these improvements, the paper not only solves the multi - armed bandit problem under adaptive rewards, but also provides new ideas and tools for future related research.