Scale-free Adversarial Reinforcement Learning

Mingyu Chen,Xuezhou Zhang
2024-03-02
Abstract:This paper initiates the study of scale-free learning in Markov Decision Processes (MDPs), where the scale of rewards/losses is unknown to the learner. We design a generic algorithmic framework, \underline{S}cale \underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this framework in both the adversarial Multi-armed Bandit (MAB) setting and the adversarial MDP setting. Through this framework, we achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs, resolving an open problem raised in \cite{hadiji2023adaptation}. On adversarial MDPs, our framework also give birth to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$ high-probability regret guarantee.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to study and solve the problem of scale - free learning in Markov Decision Processes (MDPs), especially in adversarial environments where the scale of rewards or losses is unknown to the learner. Specifically, the paper focuses on the following aspects: 1. **Scale - free learning**: Traditional reinforcement learning algorithms usually assume that rewards or losses are bounded (for example, restricted to the range of [0, 1]). However, in many real - world applications, such natural loss boundaries do not exist or are not known to the algorithm in advance. For example, in quantitative trading, stock prices change significantly over time and among different stocks, and their fluctuation ranges are often unknown. Therefore, most existing algorithms are no longer applicable in this case. 2. **Multi - armed bandit (MAB) problems in adversarial environments**: Although previous research has explored the application of scale - free learning in online learning, relevant research on uncertain decision - making (such as MAB and MDP) is very limited. In particular, for adversarial MAB, existing scale - free algorithms cannot achieve minimax optimality and can only limit the expected regret value, but cannot be generalized to high - probability regret values. 3. **Scale - free learning in adversarial MDPs**: The paper proposes a new framework - Scale Clipping Bound (SCB), and implements the first scale - free RL algorithm in the adversarial MDP setting with a $\tilde{O}(\sqrt{T})$ high - probability regret guarantee. This is the first work to achieve scale - free learning in the setting of unknown transition functions, unbounded losses, and gambling feedback. ### Main contributions 1. **Minimax optimal expected regret bound**: The author proposes a scale - free adversarial MAB algorithm SCB, which can achieve the Minimax optimal expected regret bound $\Theta(\ell_\infty\sqrt{nT})$ without knowing the loss magnitude in advance, eliminating the $\log(n)$ and $\log(T)$ factors present in previous work. 2. **High - probability regret bound**: Based on the SCB framework, the author constructs the SCB - IX algorithm, which is the first scale - free adversarial MAB algorithm that can achieve a high - probability regret bound. 3. **Scale - free learning in adversarial MDPs**: The author extends the above ideas to the adversarial MDP setting and proposes the SCB - RL algorithm, which is the first scale - free algorithm that can achieve a $\tilde{O}(\sqrt{T})$ high - probability regret bound in the setting of unknown transition functions, unbounded losses, and gambling feedback. Through these contributions, the paper solves the key challenges of scale - free learning in adversarial MAB and MDP, fills the gaps in existing research, and provides new directions for future research.