Abstract:Existing multi-agent PPO algorithms lack compatibility with different types of parameter sharing when extending the theoretical guarantee of PPO to cooperative multi-agent reinforcement learning (MARL). In this paper, we propose a novel and versatile multi-agent PPO algorithm for cooperative MARL to overcome this limitation. Our approach is achieved upon the proposed full-pipeline paradigm, which establishes multiple parallel optimization pipelines by employing various equivalent decompositions of the advantage function. This procedure successfully formulates the interconnections among agents in a more general manner, i.e., the interconnections among pipelines, making it compatible with diverse types of parameter sharing. We provide a solid theoretical foundation for policy improvement and subsequently develop a practical algorithm called Full-Pipeline PPO (FP3O) by several approximations. Empirical evaluations on Multi-Agent MuJoCo and StarCraftII tasks demonstrate that FP3O outperforms other strong baselines and exhibits remarkable versatility across various parameter-sharing configurations.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the existing Proximal Policy Optimization (PPO) algorithm in Cooperative Multi - Agent Reinforcement Learning (MARL) lacks compatibility with different types of parameter sharing. Specifically:
1. **Limitations of the existing PPO algorithm in multi - agent environments**:
- When extending the single - agent PPO to the cooperative multi - agent environment, the existing PPO algorithm cannot adapt well to different parameter sharing methods (such as full - parameter sharing, partial - parameter sharing, and non - parameter sharing). This leads to unstable performance on different tasks and network types.
- Although the existing multi - agent PPO algorithms (such as IPPO, MAPPO, CoPPO, etc.) perform well under certain specific conditions, their optimization processes rely on strict training procedures, which limit their universality and flexibility.
2. **The importance of parameter sharing**:
- Parameter sharing is an important topic in multi - agent reinforcement learning because it has a significant impact on training efficiency, knowledge transfer, and overall algorithm performance.
- Different task scenarios require different parameter sharing methods: full - parameter sharing is suitable for scenarios with homogeneous agents or a large number of agents; non - parameter sharing allows each agent to have an independent policy network, which is suitable for heterogeneous agents; partial - parameter sharing balances diversity and sharing in large - scale heterogeneous agent scenarios.
3. **Lack of theoretical guarantees**:
- When the existing PPO is extended to the multi - agent environment, there is a lack of theoretical guarantees for monotonic improvement, especially when dealing with different types of parameter sharing. This makes it difficult to predict the performance of the algorithm under different tasks and network configurations.
To solve these problems, the paper proposes a new multi - agent PPO algorithm - FP3O (Full - Pipeline PPO), aiming to overcome the limitations of existing algorithms by introducing the full - pipeline paradigm, thereby achieving wide compatibility with different types of parameter sharing and providing a solid theoretical basis to ensure monotonic improvement of the policy.
### Formula Summary
- **Advantage Function Decomposition**:
\[
A_\pi(s, a)=\sum_{m = 1}^n A^{i_m}_\pi(s, a_{i_1:m - 1}, a_{i_m})
\]
- **Lower Bound of Policy Improvement**:
\[
J(\tilde{\pi})\geq J(\pi)+\sum_{m = 1}^n M^{i_m}_\pi(\tilde{\pi}_{i_1:m - 1}, \tilde{\pi}_{i_m})
\]
where,
\[
M^{i_1:m}_\pi(\tilde{\pi}_{j_1:k}, \tilde{\pi}_{i_1:m})=\mathbb{E}_{s\sim\rho_\pi, a\sim\tilde{\pi}}\left[A^{i_1:m}_\pi(s, a_{j_1:k}, a_{i_1:m})\right]-\sum_{i\in i_1:m}C D_{\text{KL}}(\pi_i\|\tilde{\pi}_i)
\]
- **Single - Pipeline Optimization**:
\[
J(\tilde{\pi})\geq J(\pi)+M^{i_p}_\pi(\tilde{\pi}_\emptyset, \tilde{\pi}_{i_p})+M^{-i_p}_\pi(\tilde{\pi}_{i_p}, \tilde{\pi}_{-i_p})
\]
- **Full - Pipeline Paradigm**:
\[
J(\pi)+M^{i_1}_\pi(\tilde{\pi}_\emptyset, \tilde{\pi}_{i_1})+M^{-i_1}_\pi(\tilde{\pi}_{i_1}, \tilde{\pi}_{-i_1})\\
\vdots\\
J(\pi)+M^{i_n}_\pi(\t