Abstract:We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper mainly studies the learning problem in Markov Games (MGs) when facing adaptive opponents. Specifically, the authors focus on the interaction between the learner and a strategic opponent that can adjust according to the learner's strategy in a dynamically evolving environment. Most traditional Markov game research has focused on external regret as a learning objective, but in the face of adaptive opponents, external regret becomes insufficient. Therefore, this paper introduces the concept of **policy regret** to better measure the learner's performance. #### Policy Regret Policy regret is a counterfactual concept that aims to compare with the rewards that the learner could have obtained if following the optimal fixed - strategy sequence. That is: \[ PR(T)=\sup_{\pi\in\Pi}\sum_{t = 1}^T V_{\pi,f_t([\pi]_t)}(s_1)-V_{\pi_t,f_t(\pi_1,\ldots,\pi_t)}(s_1) \] where: - $PR(T)$ is the learner's policy regret after $T$ iterations. - $V_{\pi,f_t([\pi]_t)}(s_1)$ represents the expected cumulative reward starting from the initial state $s_1$ when the learner and the opponent use strategies $\pi$ and $f_t([\pi]_t)$ respectively. - $[\pi]_t$ represents repeating the strategy $\pi$ $t$ times. #### Main contributions 1. **Theoretical analysis**: - When the opponent has unbounded memory or non - stationarity, sample - efficient learning is impossible. - Even when the opponent has limited and stationary memory, when the learner's strategy set is exponentially large, learning is still statistically difficult. 2. **Algorithm design**: - Proposed effective algorithms OPO - OMLE and APE - OVE for opponents with limited, stationary and consistent memory. - For the case of $m = 1$, OPO - OMLE can achieve a policy regret of $\tilde{O}(H^3S^2AB+\sqrt{H^5SA^2BT})$. - For general $m\geq1$, APE - OVE can achieve a policy regret of $\tilde{O}\left((m - 1)H^2SAB+\sqrt{\frac{H^3SAB(SAB(H+\sqrt{S})+H^2)}{d^*}T}\right)$. Through these studies, the authors fill the theoretical gap in multi - agent reinforcement learning (MARL) regarding the policy regret of adaptive opponents and provide new ideas for designing effective learning algorithms against adaptive opponents. ### Summary The main purpose of this paper is, in Markov games, especially in the face of adaptive opponents, to evaluate the learner's performance by introducing the concept of policy regret and propose a series of theoretical results and effective algorithms to solve this problem.

Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms

Policy Regret in Repeated Games

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Online Markov Decision Processes with Non-Oblivious Strategic Adversary

Dynamic Regret of Online Markov Decision Processes

Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes

Is Learning in Games Good for the Learners?

Learning in Multi-Player Stochastic Games

√N-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank.

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Online Learning: Stochastic and Constrained Adversaries

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Online Convex Optimization in Adversarial Markov Decision Processes

Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation

Responding to Promises: No-regret learning against followers with memory

Do LLM Agents Have Regret? A Case Study in Online Learning and Games

Near Optimal Memory-Regret Tradeoff for Online Learning

Learning not to Regret

Evolutionary Dynamics and $Φ$-Regret Minimization in Games

Learning Adversarial MDPs with Stochastic Hard Constraints

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints