Abstract:This paper studies a class of multi-agent reinforcement learning (MARL) problems where the reward that an agent receives depends on the states of other agents, but the next state only depends on the agent's own current state and action. We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning. REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks. This paper presents a distributed policy gradient algorithm for REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned policy is a distributed policy that maps a local state of an agent to its local action and (ii) the learning/training is distributed, during which each agent updates its policy based on its own and neighbors' information. The learned algorithm achieves a stationary policy and its iterative complexity bounds depend on the dimension of local states and actions. The experimental results of our algorithm for the real-time access control and power control in wireless networks show that our policy significantly outperforms the state-of-the-art algorithms and well-known benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to implement a scalable and sample - efficient distributed policy gradient algorithm in multi - agent systems. Specifically, the paper focuses on a special class of multi - agent reinforcement learning (MARL) problems, namely reward - coupled multi - agent reinforcement learning (REC - MARL). In such problems, the reward obtained by an agent depends on the states and actions of other agents, but the next state depends only on the current agent's own state and action. REC - MARL has important value in applications such as real - time access control and distributed power control in wireless networks. The main contributions of the paper include: 1. **Perfect Decomposition of Value Function and Policy Gradient**: Through Lemma 1 and Lemma 2, it is proved that the global value function and policy gradient can be decomposed into the sum of local value function and policy gradient, which significantly reduces the complexity of the value function and provides a theoretical basis for the distributed multi - agent policy gradient algorithm. 2. **Regularized Distributed Multi - Agent Policy Gradient Algorithm Based on Temporal Difference (TD) Learning (TD - RDAC)**: The TD - RDAC algorithm is proposed, and it is proved in Theorem 2 that this algorithm can achieve local convergence at a rate of \(\tilde{O}\left(\frac{N S_{\max} A_{\max}}{(1 - \gamma)^4 c} \log T/T\right)\), where \(N\) is the number of agents, \(S_{\max}\) and \(A_{\max}\) are the maximum sizes of the local state space and action space respectively, \(\gamma\) is the discount factor, and \(T\) is the number of iterations. 3. **Verification in Practical Applications**: The TD - RDAC algorithm is applied to the real - time access control and power control problems in wireless networks. The experimental results show that the TD - RDAC algorithm is significantly superior to the existing state - of - the - art algorithms and benchmark algorithms. Through these contributions, the paper not only promotes the research on multi - agent reinforcement learning theoretically, but also demonstrates its effectiveness and superiority in practical applications.

Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

Observer-Based Multiagent Deep Reinforcement Learning: A Fully Distributed Training Scheme

Multi-agent Deep Reinforcement Learning Algorithm for Distributed Economic Dispatch in Smart Grid.

Multiagent Reinforcement Learning for Strictly Constrained Tasks Based on Reward Recorder

Target-Value-Competition-Based Multi-Agent Deep Reinforcement Learning Algorithm for Distributed Nonconvex Economic Dispatch

Communication-Efficient Policy Gradient Methods for Distributed Reinforcement Learning

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

Multi-Agent Reinforcement Learning in Stochastic Networked Systems

Multi-Agent Reinforcement Learning With Decentralized Distribution Correction

Multi-Agent Reinforcement Learning in Stochastic Networked Systems.

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

Mean-Field Multi-Agent Reinforcement Learning: A Decentralized Network Approach

Multi-Agent Reinforcement Learning in Time-varying Networked Systems

Cooperative Multi-Agent Reinforcement Learning with Partial Observations

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

PowerNet: Multi-agent Deep Reinforcement Learning for Scalable Powergrid Control

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Mean-Field Multiagent Reinforcement Learning: A Decentralized Network Approach