Policy Regularization via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Jian Hu,Siyue Hu,Shih-wei Liao
2023-06-08
Abstract:Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent cooperative tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX on Starcraft Multi-Agent Challenge (SMAC). MAPPO-Feature-Pruned (MAPPO-FP) improves the performance of MAPPO by the carefully designed agent-specific features, which may be not friendly to algorithmic utility. By contrast, we find that MAPPO may face the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled advantage values. Then POMAC may lead to updating the multi-agent policies in a suboptimal direction and prevent the agents from exploring better trajectories. In this paper, to mitigate the multi-agent policies overfitting, we propose a novel policy regularization method, which disturbs the advantage values via random Gaussian noise. The experimental results show that our method outperforms the Fine-tuned QMIX, MAPPO-FP, and achieves SOTA on SMAC without agent-specific features. We open-source the code at \url{<a class="link-external link-https" href="https://github.com/hijkzzz/noisy-mappo" rel="external noopener nofollow">this https URL</a>}.
Multiagent Systems
What problem does this paper attempt to address?
The paper mainly addresses cooperative tasks in Multi-Agent Reinforcement Learning (MARL), particularly the policy optimization problem under the Centralized Training with Decentralized Execution (CTDE) framework. Specifically, the issues addressed in the paper can be summarized as follows: - **Problems with existing methods**: Existing methods such as Independent Proximal Policy Optimization (IPPO) and simple Multi-Agent PPO (MAPPO) perform poorly in some complex environments. Especially, MAPPO may encounter the so-called "Policies Overfitting in Multi-Agent Cooperation (POMAC)" problem, which leads to suboptimal policy update directions and limits agents' exploration of better trajectories. - **POMAC problem**: In multi-agent cooperative tasks, since agents learn policies based on sampled advantage values, this sampling may introduce bias, leading to overfitting during policy updates. Particularly when the number of agents is large, it is difficult to accurately estimate the true gradient of each agent's contribution to the team with limited samples. - **Objective of the paper**: The paper aims to propose a new policy regularization method to alleviate the multi-agent policy overfitting problem and improve the performance of the algorithm. To achieve the above objectives, the paper proposes two policy regularization methods: Noisy-Advantage MAPPO (NA-MAPPO) and Noisy-Value MAPPO (NV-MAPPO). Both methods introduce random Gaussian noise to the advantage values to perturb the policy update process, thereby mitigating the overfitting problem. Experimental results show that these methods significantly improve the performance of the algorithm in benchmark tests such as the StarCraft Multi-Agent Challenge (SMAC), and in some scenarios, they achieve the current State-Of-The-Art (SOTA) performance. Additionally, the NV-MAPPO method also demonstrates good generalization ability in non-monotonic matrix games, proving its advantage in expressiveness.