Abstract:We consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent's action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents' policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach which exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm, referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds and that the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that $\epsilon$-c.z.CCEs can be approached whenever players' follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to make multiple agents learn to take actions in repeated games in the context of games with unknown constraints, so as to achieve no - regret and no - violation. Specifically, each agent needs to select feasible actions according to the context information, and these actions are restricted by some constraints that are unknown in advance. Since both the constraint conditions and the reward function are unknown, the key to the problem lies in how to learn a strategy that will not be regretted in the long run and will not frequently violate the constraint conditions without knowing the constraint conditions. ### Specific description of the problem 1. **Contextual game**: In each round of the game, each agent will observe a context information $z_t$, and then select an action $a_i^t$ based on this context. The selection of actions must satisfy the unknown constraint conditions. 2. **Unknown constraint conditions**: The action set $A_i(z)$ of each agent is restricted by a set of unknown constraint functions $g_{i,m}(a_i, z)\leq0$, which are functions of the context $z$ and the action $a_i$. 3. **Reward function**: The reward function $r_i(a, z)$ of each agent is also unknown and depends on the actions of all agents and the current context information. ### Objectives - **No - regret**: The agent hopes that its cumulative reward is close to the maximum cumulative reward that can be obtained by the optimal fixed strategy. - **No - violation**: The agent hopes that the number of cumulative violations of the constraint conditions is as small as possible, that is, the time - averaged number of violations of the constraint conditions converges to zero. ### Solution In order to achieve the above objectives, the author proposes a new algorithm c.z.AdaNormalGP. This algorithm works in the following ways: 1. **Gaussian Process (GP) framework**: Use the Gaussian process to estimate the unknown reward function and constraint function and construct the confidence interval. 2. **Optimistic Estimation**: For the reward function, use the upper confidence bound estimate (UCB), and for the constraint function, use the lower confidence bound estimate (LCB). This ensures that even if the estimate of the constraint function is poor in the early rounds, feasible actions can still be found. 3. **Sleeping Experts Problem**: Transform the contextual game problem with unknown constraints into the sleeping experts problem, so as to use the existing sleeping experts algorithm AdaNormalHedge to update the probability distribution of action selection. ### Main contributions - Propose a new algorithm c.z.AdaNormalGP, which can achieve no - regret and no - violation in the contextual game with unknown constraints. - Define a new concept "Constrained - contextual coarse correlated equilibrium (c.z.CCE)", and prove that when all agents follow the no - regret and no - violation strategies, they can approach this equilibrium state. - Provide theoretical guarantees for the algorithm, including the upper bounds of the regret rate and the cumulative number of constraint violations. In this way, this paper solves the problem of how to make multiple agents learn to adopt strategies that will neither regret nor frequently violate the constraint conditions in the contextual game with unknown constraint conditions.

Multi-Agent Learning in Contextual Games under Unknown Constraints

Contextual Games: Multi-Agent Learning with Side Information

Learning in Multi-Player Stochastic Games

Anytime-Constrained Multi-Agent Reinforcement Learning

Truly No-Regret Learning in Constrained MDPs

Multiagent Soft Q-Learning

High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

Optimal cross-learning for contextual bandits with unknown context distributions

Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning

Generalization of Agent Behavior through Explicit Representation of Context

Independent Learning in Constrained Markov Potential Games

Posterior Sampling for Multi-Agent Reinforcement Learning: Solving Extensive Games with Imperfect Information

A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning

Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation

Learning to Play General-Sum Games against Multiple Boundedly Rational Agents

On the Convergence of No-Regret Learning Dynamics in Time-Varying Games

Neural Auto-Curricula in Two-Player Zero-Sum Games.

Neural Auto-Curricula

Decentralized Optimal Tracking Control for Large-scale Multi-Agent Systems under Complex Environment: A Constrained Mean Field Game with Reinforcement Learning Approach

Breaking the Curse of Multiagency in Robust Multi-Agent Reinforcement Learning

Online Learning under Adversarial Nonlinear Constraints