Abstract: In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns and the randomness can be exacerbated with the increasing number of agents. However, most of the existing value-based multi-agent reinforcement learning (MARL) methods only model the expectations of individual Q-values and global Q-value, ignoring such randomness. Compared to the expectations of the long-term returns, it is more preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this work proposes DQMIX, a novel value-based MARL method, from a distributional perspective. Specifically, we model each individual Q-value with a categorical distribution. To integrate these individual Q-value distributions into the global Q-value distribution, we design a distribution mixing network, based on five basic operations on the distribution. We further prove that DQMIX satisfies the \emph{Distributional-Individual-Global-Max} (DIGM) principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global Q-value and individual Q-values. To validate DQMIX, we demonstrate its ability to factorize a matrix game with stochastic rewards. Furthermore, the experimental results on a challenging set of StarCraft II micromanagement tasks show that DQMIX consistently outperforms the value-based multi-agent reinforcement learning baselines.

Reducing overestimation in value mixing for cooperative deep multi-agent reinforcement learning

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Regularized Softmax Deep Multi-Agent Q-Learning.

Reducing Q-Value Estimation Bias Via Mutual Estimation and Softmax Operation in MADRL

Better Value Estimation in Q-learning-based Multi-Agent Reinforcement Learning

Value function factorization with dynamic weighting for deep multi-agent reinforcement learning

Learning Multi-Agent Cooperation via Considering Actions of Teammates

SQIX: QMIX Algorithm Activated by General Softmax Operator for Cooperative Multiagent Reinforcement Learning

An Overestimation Reduction Method Based on the Multi-step Weighted Double Estimation Using Value-Decomposition Multi-agent Reinforcement Learning

Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning.

Correcting Biased Value Estimation in Mixing Value-Based Multi-Agent Reinforcement Learning by Multiple Choice Learning.

POWQMIX: Weighted Value Factorization with Potentially Optimal Joint Actions Recognition for Cooperative Multi-Agent Reinforcement Learning

Sample-efficient multi-agent reinforcement learning with masked reconstruction

DQMIX: A Distributional Perspective on Multi-Agent Reinforcement Learning

QPLEX: Duplex Dueling Multi-Agent Q-Learning.

Weighted Double Deep Multiagent Reinforcement Learning in Stochastic Cooperative Environments

Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients

Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization

Multi-agent Dueling Q-learning with Mean Field and Value Decomposition

Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics

Boosting Value Decomposition Via Unit-Wise Attentive State Representation for Cooperative Multi-Agent Reinforcement Learning