Abstract:This paper initiates the study of scale-free learning in Markov Decision Processes (MDPs), where the scale of rewards/losses is unknown to the learner. We design a generic algorithmic framework, \underline{S}cale \underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this framework in both the adversarial Multi-armed Bandit (MAB) setting and the adversarial MDP setting. Through this framework, we achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs, resolving an open problem raised in \cite{hadiji2023adaptation}. On adversarial MDPs, our framework also give birth to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$ high-probability regret guarantee.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to study and solve the problem of scale - free learning in Markov Decision Processes (MDPs), especially in adversarial environments where the scale of rewards or losses is unknown to the learner. Specifically, the paper focuses on the following aspects: 1. **Scale - free learning**: Traditional reinforcement learning algorithms usually assume that rewards or losses are bounded (for example, restricted to the range of [0, 1]). However, in many real - world applications, such natural loss boundaries do not exist or are not known to the algorithm in advance. For example, in quantitative trading, stock prices change significantly over time and among different stocks, and their fluctuation ranges are often unknown. Therefore, most existing algorithms are no longer applicable in this case. 2. **Multi - armed bandit (MAB) problems in adversarial environments**: Although previous research has explored the application of scale - free learning in online learning, relevant research on uncertain decision - making (such as MAB and MDP) is very limited. In particular, for adversarial MAB, existing scale - free algorithms cannot achieve minimax optimality and can only limit the expected regret value, but cannot be generalized to high - probability regret values. 3. **Scale - free learning in adversarial MDPs**: The paper proposes a new framework - Scale Clipping Bound (SCB), and implements the first scale - free RL algorithm in the adversarial MDP setting with a $\tilde{O}(\sqrt{T})$ high - probability regret guarantee. This is the first work to achieve scale - free learning in the setting of unknown transition functions, unbounded losses, and gambling feedback. ### Main contributions 1. **Minimax optimal expected regret bound**: The author proposes a scale - free adversarial MAB algorithm SCB, which can achieve the Minimax optimal expected regret bound $\Theta(\ell_\infty\sqrt{nT})$ without knowing the loss magnitude in advance, eliminating the $\log(n)$ and $\log(T)$ factors present in previous work. 2. **High - probability regret bound**: Based on the SCB framework, the author constructs the SCB - IX algorithm, which is the first scale - free adversarial MAB algorithm that can achieve a high - probability regret bound. 3. **Scale - free learning in adversarial MDPs**: The author extends the above ideas to the adversarial MDP setting and proposes the SCB - RL algorithm, which is the first scale - free algorithm that can achieve a $\tilde{O}(\sqrt{T})$ high - probability regret bound in the setting of unknown transition functions, unbounded losses, and gambling feedback. Through these contributions, the paper solves the key challenges of scale - free learning in adversarial MAB and MDP, fills the gaps in existing research, and provides new directions for future research.

Scale-free Adversarial Reinforcement Learning

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Learning Adversarial MDPs with Stochastic Hard Constraints

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Truly No-Regret Learning in Constrained MDPs

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation

Dynamic Regret of Online Markov Decision Processes

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

An Information-Theoretic Analysis of Bayesian Reinforcement Learning

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback

Almost Optimal Model-Free Reinforcement Learning Via Reference-Advantage Decomposition

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank