Decentralized Multi-Agent Reinforcement Learning for Continuous-Space Stochastic Games

Awni Altabaa,Bora Yongacoglu,Serdar Yüksel
DOI: https://doi.org/10.23919/ACC55779.2023.10155828
2023-03-16
Abstract:Stochastic games are a popular framework for studying multi-agent reinforcement learning (MARL). Recent advances in MARL have focused primarily on games with finitely many states. In this work, we study multi-agent learning in stochastic games with general state spaces and an information structure in which agents do not observe each other's actions. In this context, we propose a decentralized MARL algorithm and we prove the near-optimality of its policy updates. Furthermore, we study the global policy-updating dynamics for a general class of best-reply based algorithms and derive a closed-form characterization of convergence probabilities over the joint policy space.
Machine Learning,Computer Science and Game Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve decentralized multi - agent reinforcement learning (MARL) in continuous - space stochastic games with standard Borel state spaces. Specifically, the paper focuses on designing a decentralized MARL algorithm under an information structure where agents cannot observe the actions of other agents, and proving the approximate optimality of its policy updates. In addition, the paper also studies the global policy - update dynamics of a class of best - reply - based algorithms and derives a closed - form expression for the convergence probability on the joint - policy space. ### Main problem decomposition: 1. **Policy update in a decentralized learning environment**: - In a decentralized environment, agents cannot directly observe the actions of other agents. Therefore, a mechanism needs to be designed so that each agent can independently update its policy based on its own observations. - The paper proposes a method of quantizing the state space, discretizing the continuous state space into a finite set of states, so that the Q - learning algorithm can be applied. 2. **Proof of approximate optimality**: - The paper proves that, under the assumptions of weakly continuous transition kernels and continuous bounded cost functions, the policy update of the proposed decentralized MARL algorithm is asymptotically optimal. - By introducing the exploration - phase technique, agents only update their policies at predetermined time points and keep their policies unchanged between these time points. 3. **Analysis of global policy - update dynamics**: - The paper analyzes the global policy - update dynamics of a class of best - reply - based algorithms, modeling it as a Markov chain. - It derives a closed - form expression for the probability of converging to an equilibrium point under a weakly cyclic structure. ### Formula representation: - Quantization mapping of the state space: \[ q: X \to Y \] - where \( X \) is the original state space and \( Y=\{y_1, y_2,\dots,y_M\} \) is the quantized finite state space. - Quantized Q - learning update rule: \[ Q_t^{(i)}(q(x), u)=(1 - \alpha_t(q(x), u))Q_{t - 1}^{(i)}(q(x), u)+\alpha_t(q(x), u)\left(c(x, u)+\beta\min_{v\in U}Q_{t - 1}^{(i)}(q(X_{t + 1}), v)\right) \] - Best - reply set: \[ \text{BR}_i^\delta(Q_t)=\left\{\hat{\gamma}\in\hat{\Pi}_i^q:Q_t(y,\hat{\gamma}(y))\leq\min_{u\in U_i}Q_t(y, u)+\delta_i,\forall y\in Y_i^q\right\} \] ### Summary: The main contribution of the paper is to propose a decentralized multi - agent reinforcement learning algorithm suitable for continuous - state - space stochastic games and prove the approximate optimality of its policy updates. In addition, the paper analyzes the global policy - update dynamics of best - reply - based algorithms and gives a closed - form expression for the probability of converging to an equilibrium point. These results provide a theoretical basis for understanding and optimizing decentralized multi - agent systems.