Abstract:Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. While numerous approaches have been developed, they can be broadly categorized into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and decentralized training and execution (DTE). CTE methods assume centralization during training and execution (e.g., with fast, free, and perfect communication) and have the most information during execution. CTDE methods are the most common, as they leverage centralized information during training while enabling decentralized execution -- using only information available to that agent during execution. Decentralized training and execution methods make the fewest assumptions and are often simple to implement.
This text is an introduction to cooperative MARL -- MARL in which all agents share a single, joint reward. It is meant to explain the setting, basic concepts, and common methods for the CTE, CTDE, and DTE settings. It does not cover all work in cooperative MARL as the area is quite extensive. I have included work that I believe is important for understanding the main concepts in the area and apologize to those that I have omitted. Topics include simple applications of single-agent methods to CTE as well as some more scalable methods that exploit the multi-agent structure, independent Q-learning and policy gradient methods and their extensions, as well as value function factorization methods including the well-known VDN, QMIX, and QPLEX approaches, abd centralized critic methods including MADDPG, COMA, and MAPPO. I also discuss common misconceptions, the relationship between different approaches, and some open questions.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the cooperation problem in multi - agent reinforcement learning (MARL). Specifically, it is about how to conduct effective learning and decision - making in an environment where all agents share a single joint reward. The paper mainly focuses on how multiple agents can cooperate to maximize the common long - term return in a partially observable environment.
### Cooperative Multi - agent Reinforcement Learning (Cooperative MARL) Problem
The cooperative multi - agent reinforcement learning problem mentioned in the paper can be formalized as a **Decentralized Partially Observable Markov Decision Process (Dec - POMDP)**. Dec - POMDP is an extension of POMDP, applicable to multi - agent, decentralized settings, and all agents have the same reward function, so it is cooperative.
#### Definition of Dec - POMDP
Dec - POMDP is defined by the following tuple:
\[ \langle I, S, \{A_i\}, T, R, \{O_i\}, O, H, \gamma \rangle \]
- \( I \): a finite set of agents, with size \( |I| = n \)
- \( S \): a finite set of states, with initial state distribution \( b_0 \)
- \( A_i \): the action set of each agent \( i \), and the joint action set \( A=\times_i A_i \)
- \( T \): the state transition probability function \( T: S\times A\times S\rightarrow[0, 1] \), representing the probability of transitioning from state \( s \) to state \( s' \)
- \( R \): the reward function \( R: S\times A\rightarrow\mathbb{R} \), representing the immediate reward for taking action \( a \) in state \( s \)
- \( O_i \): the observation set of each agent \( i \), and the joint observation set \( O = \times_i O_i \)
- \( O \): the observation probability function \( O: O\times A\times S\rightarrow[0, 1] \), representing the probability of observing after transitioning to state \( s' \) after taking action \( a \)
- \( H \): the number of time steps before termination, called the horizon
- \( \gamma\in[0, 1] \): the discount factor
### Overview of Solutions
The paper mainly discusses three training and execution modes:
1. **Centralized Training and Execution (CTE)**: It is assumed that there is centralized information (such as fast, free, and perfect communication) during both training and execution, allowing each agent's action to depend on the information of all agents.
2. **Centralized Training and Distributed Execution (CTDE)**: Use centralized information for training, but only use local information during execution. This is the most common way and can improve scalability while maintaining a certain performance.
3. **Distributed Training and Execution (DTE)**: Without making any assumptions, all agents learn independently, which is suitable for situations without a centralized training phase.
### Main Challenges
1. **Partial Observability**: Since each agent can only obtain local observations, historical information needs to be considered to make better decisions.
2. **Complex State and Action Spaces**: As the number of agents increases, the joint state and action spaces grow exponentially, resulting in a sharp increase in computational complexity.
3. **Coordination and Cooperation**: Agents need to cooperate effectively to maximize the common long - term return.
Through a detailed discussion of these methods, the paper aims to provide researchers with a basic framework for understanding and applying cooperative multi - agent reinforcement learning.