Abstract:Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. Many approaches have been developed but they can be divided into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and Decentralized training and execution (DTE).
CTDE methods are the most common as they can use centralized information during training but execute in a decentralized manner -- using only information available to that agent during execution. CTDE is the only paradigm that requires a separate training phase where any available information (e.g., other agent policies, underlying states) can be used. As a result, they can be more scalable than CTE methods, do not require communication during execution, and can often perform well. CTDE fits most naturally with the cooperative case, but can be potentially applied in competitive or mixed settings depending on what information is assumed to be observed.
This text is an introduction to CTDE in cooperative MARL. It is meant to explain the setting, basic concepts, and common methods. It does not cover all work in CTDE MARL as the subarea is quite extensive. I have included work that I believe is important for understanding the main concepts in the subarea and apologize to those that I have omitted.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in Cooperative Multi - Agent Reinforcement Learning (MARL), how to use the method of Centralized Training for Decentralized Execution (CTDE) to improve the cooperation efficiency of multiple agents in an uncertain environment. Specifically, the paper explores how to use the information of all agents for centralized training during the training phase, while during the execution phase each agent acts independently relying only on its own local observation information, so as to achieve effective cooperation.
### Background and Problem Definition
The paper first introduces the formal description of the cooperative multi - agent reinforcement learning problem - the Decentralized Partially Observable Markov Decision Process (Dec - POMDP). Dec - POMDP is a multi - agent extension of POMDP, in which each agent makes decisions based on local observation information, the team obtains a joint reward, but each agent's action can only be based on its own local information.
### CTDE Overview
The core idea of the CTDE method is to allow a certain degree of centralization during the training phase, such as using the strategies of other agents, the underlying state and other information, while during the execution phase it is completely decentralized, that is, each agent independently selects actions according to its own historical information. This method aims to combine the advantages of centralized training (such as making full use of all available information) and the advantages of decentralized execution (such as no need for communication and higher scalability).
### Main Methods
1. **Value Function Decomposition Methods**:
- **VDN (Value Decomposition Network)**: By decomposing the joint Q - function into the sum of the Q - functions of each agent, centralized training and decentralized execution are achieved. Specifically, assume that \( Q(h, a)\approx\sum_{i\in I}Q_{i}(h_{i}, a_{i})\), where \( Q_{i}(h_{i}, a_{i})\) is the Q - function of the \(i\) - th agent.
- **QMIX**: Further extends the idea of VDN, allowing more general monotonic functions to decompose the joint Q - function. Specifically, assume that \( Q(h, a)\approx f_{\text{mono}}(Q_{1}(h_{1}, a_{1}),\ldots, Q_{n}(h_{n}, a_{n}))\), where \( f_{\text{mono}}\) is a monotonic function.
2. **Centralized Critic Methods**:
- **MADDPG (Multi - Agent Deep Deterministic Policy Gradient)**: Uses a centralized critic to estimate the joint value function and uses decentralized actors to update the strategy of each agent.
- **COMA (Counterfactual Multi - Agent Policy Gradients)**: By introducing a counterfactual baseline to reduce the credit assignment problem and improve the performance of multi - agent policy gradient methods.
- **MAPPO (Multi - Agent Proximal Policy Optimization)**: Applies Proximal Policy Optimization (PPO) in a multi - agent environment and uses a centralized critic to improve the learning of decentralized strategies.
### Other Forms of CTDE
The paper also discusses other forms of CTDE methods, such as adding centralized information (for example, parameter sharing) in decentralized methods and decentralizing centralized solutions.
### Summary
The main purpose of the paper is to introduce the basic concepts and common techniques of the CTDE method, especially the value function decomposition method and the centralized critic method. These methods aim to solve the key challenges in cooperative multi - agent reinforcement learning, that is, how to make use of it while maintaining decentralized execution.