Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms

Thanh Nguyen-Tang,Raman Arora
2024-11-02
Abstract:We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.
Machine Learning,Artificial Intelligence,Computer Science and Game Theory
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper mainly studies the learning problem in Markov Games (MGs) when facing adaptive opponents. Specifically, the authors focus on the interaction between the learner and a strategic opponent that can adjust according to the learner's strategy in a dynamically evolving environment. Most traditional Markov game research has focused on external regret as a learning objective, but in the face of adaptive opponents, external regret becomes insufficient. Therefore, this paper introduces the concept of **policy regret** to better measure the learner's performance. #### Policy Regret Policy regret is a counterfactual concept that aims to compare with the rewards that the learner could have obtained if following the optimal fixed - strategy sequence. That is: \[ PR(T)=\sup_{\pi\in\Pi}\sum_{t = 1}^T V_{\pi,f_t([\pi]_t)}(s_1)-V_{\pi_t,f_t(\pi_1,\ldots,\pi_t)}(s_1) \] where: - \(PR(T)\) is the learner's policy regret after \(T\) iterations. - \(V_{\pi,f_t([\pi]_t)}(s_1)\) represents the expected cumulative reward starting from the initial state \(s_1\) when the learner and the opponent use strategies \(\pi\) and \(f_t([\pi]_t)\) respectively. - \([\pi]_t\) represents repeating the strategy \(\pi\) \(t\) times. #### Main contributions 1. **Theoretical analysis**: - When the opponent has unbounded memory or non - stationarity, sample - efficient learning is impossible. - Even when the opponent has limited and stationary memory, when the learner's strategy set is exponentially large, learning is still statistically difficult. 2. **Algorithm design**: - Proposed effective algorithms OPO - OMLE and APE - OVE for opponents with limited, stationary and consistent memory. - For the case of \(m = 1\), OPO - OMLE can achieve a policy regret of \(\tilde{O}(H^3S^2AB+\sqrt{H^5SA^2BT})\). - For general \(m\geq1\), APE - OVE can achieve a policy regret of \(\tilde{O}\left((m - 1)H^2SAB+\sqrt{\frac{H^3SAB(SAB(H+\sqrt{S})+H^2)}{d^*}T}\right)\). Through these studies, the authors fill the theoretical gap in multi - agent reinforcement learning (MARL) regarding the policy regret of adaptive opponents and provide new ideas for designing effective learning algorithms against adaptive opponents. ### Summary The main purpose of this paper is, in Markov games, especially in the face of adaptive opponents, to evaluate the learner's performance by introducing the concept of policy regret and propose a series of theoretical results and effective algorithms to solve this problem.