Abstract:The behavior of no-regret learning algorithms is well understood in two-player min-max (i.e, zero-sum) games. In this paper, we investigate the behavior of no-regret learning in min-max games with dependent strategy sets, where the strategy of the first player constrains the behavior of the second. Such games are best understood as sequential, i.e., min-max Stackelberg, games. We consider two settings, one in which only the first player chooses their actions using a no-regret algorithm while the second player best responds, and one in which both players use no-regret algorithms. For the former case, we show that no-regret dynamics converge to a Stackelberg equilibrium. For the latter case, we introduce a new type of regret, which we call Lagrangian regret, and show that if both players minimize their Lagrangian regrets, then play converges to a Stackelberg equilibrium. We then observe that online mirror descent (OMD) dynamics in these two settings correspond respectively to a known nested (i.e., sequential) gradient descent-ascent (GDA) algorithm and a new simultaneous GDA-like algorithm, thereby establishing convergence of these algorithms to Stackelberg equilibrium. Finally, we analyze the robustness of OMD dynamics to perturbations by investigating online min-max Stackelberg games. We prove that OMD dynamics are robust for a large class of online min-max games with independent strategy sets. In the dependent case, we demonstrate the robustness of OMD dynamics experimentally by simulating them in online Fisher markets, a canonical example of a min-max Stackelberg game with dependent strategy sets.

Online Markov Decision Processes with Non-Oblivious Strategic Adversary

Dynamic Regret of Online Markov Decision Processes

Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes

Online Double Oracle

Online Convex Optimization in Adversarial Markov Decision Processes

Online Markov decision processes with policy iteration

Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback

Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

√N-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank.

Robust No-Regret Learning in Min-Max Stackelberg Games

Learning Adversarial MDPs with Stochastic Hard Constraints

Online Policy Optimization for Robust MDP

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Truly No-Regret Learning in Constrained MDPs

Faster Optimistic Online Mirror Descent for Extensive-Form Games

Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

Regret-Minimizing Double Oracle for Extensive-Form Games