Abstract:While Centralized Training with Decentralized Execution (CTDE) has become the prevailing paradigm in Multi-Agent Reinforcement Learning (MARL), it may not be suitable for scenarios in which agents can fully communicate and share observations with each other. Fully centralized methods, also know as Centralized Training with Centralized Execution (CTCE) methods, can fully utilize observations of all the agents by treating the entire system as a single agent. However, traditional CTCE methods suffer from scalability issues due to the exponential growth of the joint action space. To address these challenges, in this paper we propose JointPPO, a CTCE method that uses Proximal Policy Optimization (PPO) to directly optimize the joint policy of the multi-agent system. JointPPO decomposes the joint policy into conditional probabilities, transforming the decision-making process into a sequence generation task. A Transformer-based joint policy network is constructed, trained with a PPO loss tailored for the joint policy. JointPPO effectively handles a large joint action space and extends PPO to multi-agent setting in a clear and concise manner. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) testbed demonstrate the superiority of JointPPO over strong baselines. Ablation experiments and analyses are conducted to explores the factors influencing JointPPO's performance.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily aims to address several core issues in Multi-Agent Reinforcement Learning (MARL): 1. **Scalability Issue of Centralized Training and Execution (CTCE) Methods**: - Traditional CTCE methods face scalability challenges in practical applications due to the exponential growth of the joint action space as the number of agents increases. 2. **Limitations on Information Sharing in Existing Methods**: - Existing fully decentralized execution (CTDE) methods, while able to utilize all agents' observations during training, restrict information sharing among agents during execution, which is insufficient in scenarios with strong communication capabilities. To tackle the above challenges, the authors propose a new method called JointPPO. JointPPO adopts a fully centralized perspective by directly applying Proximal Policy Optimization (PPO) to the joint policy of the multi-agent system, effectively addressing the aforementioned issues and simplifying MARL to the extent of single-agent reinforcement learning. Additionally, JointPPO leverages the Transformer model to handle high-dimensional joint action spaces, further enhancing its performance in complex tasks. ### Main Contributions 1. **Conditional Probability Decomposition of Joint Policy**: Explicitly decomposes the joint policy of the multi-agent system into conditional probabilities, transforming the decision-making process into a sequence generation task. 2. **General Framework**: Proposes a general framework that utilizes any sequence generation model to solve MARL problems. 3. **Specific Implementation of JointPPO**: As an instance of this framework, JointPPO directly optimizes the joint policy and effectively handles high-dimensional joint action spaces through the Transformer model. 4. **Experimental Validation**: Conducts extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) test platform, demonstrating the advantages of JointPPO over existing methods.

JointPPO: Diving Deeper into the Effectiveness of PPO in Multi-Agent Reinforcement Learning

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

FP3O: Enabling Proximal Policy Optimization in Multi-Agent Cooperation with Parameter-Sharing Versatility

Learning Effective Communication for Cooperative Pursuit with Multi-Agent Reinforcement Learning

Joint action loss for proximal policy optimization

An Improved PPO for Multiple Unmanned Aerial Vehicles

Coordinated Proximal Policy Optimization

Is Centralized Training with Decentralized Execution Framework Centralized Enough for MARL?

Proximal Policy Optimization Based Decentralized Networked Multi-Agent Reinforcement Learning

Towards Global Optimality in Cooperative MARL with Sequential Transformation

Decentralized Policy Optimization

Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

Policy Regularization via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Backpropagation Through Agents

Research on Multi-Agent Communication and Collaborative Decision-Making Based on Deep Reinforcement Learning

Heterogeneous Multi-Agent Reinforcement Learning for Zero-Shot Scalable Collaboration

Intelligent Decentralized Multiple Access Via Multi- Agent Deep Reinforcement Learning

Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework

Multi-Agent Constrained Policy Optimisation