Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

Xin Liu,Honghao Wei,Lei Ying
DOI: https://doi.org/10.48550/arXiv.2212.06357
2023-05-15
Abstract:This paper studies a class of multi-agent reinforcement learning (MARL) problems where the reward that an agent receives depends on the states of other agents, but the next state only depends on the agent's own current state and action. We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning. REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks. This paper presents a distributed policy gradient algorithm for REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned policy is a distributed policy that maps a local state of an agent to its local action and (ii) the learning/training is distributed, during which each agent updates its policy based on its own and neighbors' information. The learned algorithm achieves a stationary policy and its iterative complexity bounds depend on the dimension of local states and actions. The experimental results of our algorithm for the real-time access control and power control in wireless networks show that our policy significantly outperforms the state-of-the-art algorithms and well-known benchmarks.
Multiagent Systems,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to implement a scalable and sample - efficient distributed policy gradient algorithm in multi - agent systems. Specifically, the paper focuses on a special class of multi - agent reinforcement learning (MARL) problems, namely reward - coupled multi - agent reinforcement learning (REC - MARL). In such problems, the reward obtained by an agent depends on the states and actions of other agents, but the next state depends only on the current agent's own state and action. REC - MARL has important value in applications such as real - time access control and distributed power control in wireless networks. The main contributions of the paper include: 1. **Perfect Decomposition of Value Function and Policy Gradient**: Through Lemma 1 and Lemma 2, it is proved that the global value function and policy gradient can be decomposed into the sum of local value function and policy gradient, which significantly reduces the complexity of the value function and provides a theoretical basis for the distributed multi - agent policy gradient algorithm. 2. **Regularized Distributed Multi - Agent Policy Gradient Algorithm Based on Temporal Difference (TD) Learning (TD - RDAC)**: The TD - RDAC algorithm is proposed, and it is proved in Theorem 2 that this algorithm can achieve local convergence at a rate of \(\tilde{O}\left(\frac{N S_{\max} A_{\max}}{(1 - \gamma)^4 c} \log T/T\right)\), where \(N\) is the number of agents, \(S_{\max}\) and \(A_{\max}\) are the maximum sizes of the local state space and action space respectively, \(\gamma\) is the discount factor, and \(T\) is the number of iterations. 3. **Verification in Practical Applications**: The TD - RDAC algorithm is applied to the real - time access control and power control problems in wireless networks. The experimental results show that the TD - RDAC algorithm is significantly superior to the existing state - of - the - art algorithms and benchmark algorithms. Through these contributions, the paper not only promotes the research on multi - agent reinforcement learning theoretically, but also demonstrates its effectiveness and superiority in practical applications.