Abstract:Existing multi-agent PPO algorithms lack compatibility with different types of parameter sharing when extending the theoretical guarantee of PPO to cooperative multi-agent reinforcement learning (MARL). In this paper, we propose a novel and versatile multi-agent PPO algorithm for cooperative MARL to overcome this limitation. Our approach is achieved upon the proposed full-pipeline paradigm, which establishes multiple parallel optimization pipelines by employing various equivalent decompositions of the advantage function. This procedure successfully formulates the interconnections among agents in a more general manner, i.e., the interconnections among pipelines, making it compatible with diverse types of parameter sharing. We provide a solid theoretical foundation for policy improvement and subsequently develop a practical algorithm called Full-Pipeline PPO (FP3O) by several approximations. Empirical evaluations on Multi-Agent MuJoCo and StarCraftII tasks demonstrate that FP3O outperforms other strong baselines and exhibits remarkable versatility across various parameter-sharing configurations.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the existing Proximal Policy Optimization (PPO) algorithm in Cooperative Multi - Agent Reinforcement Learning (MARL) lacks compatibility with different types of parameter sharing. Specifically: 1. **Limitations of the existing PPO algorithm in multi - agent environments**: - When extending the single - agent PPO to the cooperative multi - agent environment, the existing PPO algorithm cannot adapt well to different parameter sharing methods (such as full - parameter sharing, partial - parameter sharing, and non - parameter sharing). This leads to unstable performance on different tasks and network types. - Although the existing multi - agent PPO algorithms (such as IPPO, MAPPO, CoPPO, etc.) perform well under certain specific conditions, their optimization processes rely on strict training procedures, which limit their universality and flexibility. 2. **The importance of parameter sharing**: - Parameter sharing is an important topic in multi - agent reinforcement learning because it has a significant impact on training efficiency, knowledge transfer, and overall algorithm performance. - Different task scenarios require different parameter sharing methods: full - parameter sharing is suitable for scenarios with homogeneous agents or a large number of agents; non - parameter sharing allows each agent to have an independent policy network, which is suitable for heterogeneous agents; partial - parameter sharing balances diversity and sharing in large - scale heterogeneous agent scenarios. 3. **Lack of theoretical guarantees**: - When the existing PPO is extended to the multi - agent environment, there is a lack of theoretical guarantees for monotonic improvement, especially when dealing with different types of parameter sharing. This makes it difficult to predict the performance of the algorithm under different tasks and network configurations. To solve these problems, the paper proposes a new multi - agent PPO algorithm - FP3O (Full - Pipeline PPO), aiming to overcome the limitations of existing algorithms by introducing the full - pipeline paradigm, thereby achieving wide compatibility with different types of parameter sharing and providing a solid theoretical basis to ensure monotonic improvement of the policy. ### Formula Summary - **Advantage Function Decomposition**: \[ A_\pi(s, a)=\sum_{m = 1}^n A^{i_m}_\pi(s, a_{i_1:m - 1}, a_{i_m}) \] - **Lower Bound of Policy Improvement**: \[ J(\tilde{\pi})\geq J(\pi)+\sum_{m = 1}^n M^{i_m}_\pi(\tilde{\pi}_{i_1:m - 1}, \tilde{\pi}_{i_m}) \] where, \[ M^{i_1:m}_\pi(\tilde{\pi}_{j_1:k}, \tilde{\pi}_{i_1:m})=\mathbb{E}_{s\sim\rho_\pi, a\sim\tilde{\pi}}\left[A^{i_1:m}_\pi(s, a_{j_1:k}, a_{i_1:m})\right]-\sum_{i\in i_1:m}C D_{\text{KL}}(\pi_i\|\tilde{\pi}_i) \] - **Single - Pipeline Optimization**: \[ J(\tilde{\pi})\geq J(\pi)+M^{i_p}_\pi(\tilde{\pi}_\emptyset, \tilde{\pi}_{i_p})+M^{-i_p}_\pi(\tilde{\pi}_{i_p}, \tilde{\pi}_{-i_p}) \] - **Full - Pipeline Paradigm**: \[ J(\pi)+M^{i_1}_\pi(\tilde{\pi}_\emptyset, \tilde{\pi}_{i_1})+M^{-i_1}_\pi(\tilde{\pi}_{i_1}, \tilde{\pi}_{-i_1})\\ \vdots\\ J(\pi)+M^{i_n}_\pi(\t

FP3O: Enabling Proximal Policy Optimization in Multi-Agent Cooperation with Parameter-Sharing Versatility

Coordinated Proximal Policy Optimization

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

JointPPO: Diving Deeper into the Effectiveness of PPO in Multi-Agent Reinforcement Learning

Policy Regularization via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Multi-Path Policy Optimization

Meta Proximal Policy Optimization for Cooperative Multi-Agent Continuous Control

An Improved PPO for Multiple Unmanned Aerial Vehicles

Beyond the Boundaries of Proximal Policy Optimization

Towards Global Optimality in Cooperative MARL with Sequential Transformation

Authentic Boundary Proximal Policy Optimization

Truly Proximal Policy Optimization

Exploration in policy optimization through multiple paths

Communication-Efficient Cooperative Multi-Agent PPO via Regulated Segment Mixture in Internet of Vehicles

Learning Effective Communication for Cooperative Pursuit with Multi-Agent Reinforcement Learning

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Decentralized Policy Optimization

Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework