Abstract:Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent cooperative tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX on Starcraft Multi-Agent Challenge (SMAC). MAPPO-Feature-Pruned (MAPPO-FP) improves the performance of MAPPO by the carefully designed agent-specific features, which may be not friendly to algorithmic utility. By contrast, we find that MAPPO may face the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled advantage values. Then POMAC may lead to updating the multi-agent policies in a suboptimal direction and prevent the agents from exploring better trajectories. In this paper, to mitigate the multi-agent policies overfitting, we propose a novel policy regularization method, which disturbs the advantage values via random Gaussian noise. The experimental results show that our method outperforms the Fine-tuned QMIX, MAPPO-FP, and achieves SOTA on SMAC without agent-specific features. We open-source the code at \url{<a class="link-external link-https" href="https://github.com/hijkzzz/noisy-mappo" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper mainly addresses cooperative tasks in Multi-Agent Reinforcement Learning (MARL), particularly the policy optimization problem under the Centralized Training with Decentralized Execution (CTDE) framework. Specifically, the issues addressed in the paper can be summarized as follows: - **Problems with existing methods**: Existing methods such as Independent Proximal Policy Optimization (IPPO) and simple Multi-Agent PPO (MAPPO) perform poorly in some complex environments. Especially, MAPPO may encounter the so-called "Policies Overfitting in Multi-Agent Cooperation (POMAC)" problem, which leads to suboptimal policy update directions and limits agents' exploration of better trajectories. - **POMAC problem**: In multi-agent cooperative tasks, since agents learn policies based on sampled advantage values, this sampling may introduce bias, leading to overfitting during policy updates. Particularly when the number of agents is large, it is difficult to accurately estimate the true gradient of each agent's contribution to the team with limited samples. - **Objective of the paper**: The paper aims to propose a new policy regularization method to alleviate the multi-agent policy overfitting problem and improve the performance of the algorithm. To achieve the above objectives, the paper proposes two policy regularization methods: Noisy-Advantage MAPPO (NA-MAPPO) and Noisy-Value MAPPO (NV-MAPPO). Both methods introduce random Gaussian noise to the advantage values to perturb the policy update process, thereby mitigating the overfitting problem. Experimental results show that these methods significantly improve the performance of the algorithm in benchmark tests such as the StarCraft Multi-Agent Challenge (SMAC), and in some scenarios, they achieve the current State-Of-The-Art (SOTA) performance. Additionally, the NV-MAPPO method also demonstrates good generalization ability in non-monotonic matrix games, proving its advantage in expressiveness.

Policy Regularization via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

Coordinated Proximal Policy Optimization

MAPPO method based on attention behavior network

FP3O: Enabling Proximal Policy Optimization in Multi-Agent Cooperation with Parameter-Sharing Versatility

Optimistic Multi-Agent Policy Gradient

MACRPO: Multi-Agent Cooperative Recurrent Policy Optimization

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

Decentralized Policy Optimization

JointPPO: Diving Deeper into the Effectiveness of PPO in Multi-Agent Reinforcement Learning

Off-Policy Multi-Agent Decomposed Policy Gradients

Multi-Agent Constrained Policy Optimisation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Communication-Efficient Cooperative Multi-Agent PPO via Regulated Segment Mixture in Internet of Vehicles

Multi-agent Policy Optimization with Approximatively Synchronous Advantage Estimation

CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric

Unlocking the Potential of MAPPO with Asynchronous Optimization

MAPPO-PIS: A Multi-Agent Proximal Policy Optimization Method with Prior Intent Sharing for CAVs' Cooperative Decision-Making

Policy Optimization with Model-based Explorations

Multi-Path Policy Optimization