Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation

Qiwen Cui,Kaiqing Zhang,Simon S. Du
2023-06-22
Abstract:We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters.
Machine Learning,Artificial Intelligence,Computer Science and Game Theory,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively conduct reinforcement learning in multi - agent systems when both the state space and the number of agents are very large. Specifically, the paper proposes a new model named independent linear Markov game, aiming to overcome the challenges brought by multiple agents through independent linear function approximation. Different from the existing global function approximation methods, this model allows each agent to have its own function approximation for estimating the state - action value function, and these value functions are marginalized through the strategies of other agents. The paper designs new algorithms to learn Markov coarse - correlated equilibria (CCE) and Markov correlated equilibria (CE), and the sample complexity only grows polynomially with the complexity of each agent's own function class, thus breaking the curse of multiple agents. ### Main contributions and technological innovation points 1. **Independent linear function approximation in multi - player general - sum and Markov games**: - Proposed the independent linear Markov game model, which is the first model proven to be effective in multi - agent reinforcement learning and allows each agent to have its own independent function approximation. - Proved that the independent linear Markov game model can capture several important instances, such as tabular Markov games, linear Markov decision processes, and congestion games. - Designed the first efficient algorithm in multi - agent reinforcement learning that simultaneously breaks the curse of multiple agents and the curse of large - scale state - action spaces, and the sample complexity only depends on the complexity of the independent function class. 2. **Learning Nash equilibria in independent linear Markov potential games**: - Provided an algorithm for learning Markov Nash equilibria (NE) in independent linear Markov potential games. - This algorithm is based on the reduction from learning Nash equilibria in independent linear Markov potential games to learning optimal policies in linear MDPs. - The result directly implies a provably efficient decentralized algorithm for learning Nash equilibria in congestion games, and its sample complexity is better than the previous best result. 3. **Improving the sample complexity of tabular multi - player general - sum and Markov games**: - In addition to the contributions to Markov games with function approximation, an algorithm suitable for tabular Markov games was also designed. By adapting the policy replay mechanism in the proposed independent linear Markov game, the sample complexity of learning Markov coarse - correlated equilibria was significantly improved. - The sample complexity is \(\tilde{O}(H^6 S^2 A_{\max} \epsilon^{-2})\), which is significantly better than the previous best result \(\tilde{O}(H^{11} S^3 A_{\max} \epsilon^{-3})\). - Also provided the first provably efficient algorithm for learning Markov correlated equilibria, with a sample complexity of \(\tilde{O}(H^6 S^2 A_{\max}^2 \epsilon^{-2})\). ### Technological innovation points - **Policy replay to deal with non - stationarity**: Different from experience replay, policy replay completely updates the data set in each episode. By collecting fresh data using the policy set, it allows efficient exploration and adaptation to the non - stationarity caused by multiple agents and function approximation. - **Separating exploration and learning Markov equilibria**: By taking advantage of the fact that other agents are not adversarial in the self - play setting, independent and identically distributed feedback can be sampled multiple times, thus obtaining more accurate estimates instead of relying solely on a single bandit feedback. This makes any no - regret algorithm with full - information feedback sufficient to support our multi - agent reinforcement learning algorithm, which is significantly weaker than the adversarial bandit oracle used in all previous works that break the curse of multiple agents. ### Related work - **Tabular Markov games**: Discussed the related work in the settings of considering bandit feedback and full - information feedback, as well as the convergence and individual regret under different conditions. - **Markov games with function approximation**: Discussed how to apply the function approximation framework in single - agent reinforcement learning to multi - agent reinforcement learning.