Abstract: In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns and the randomness can be exacerbated with the increasing number of agents. However, most of the existing value-based multi-agent reinforcement learning (MARL) methods only model the expectations of individual Q-values and global Q-value, ignoring such randomness. Compared to the expectations of the long-term returns, it is more preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this work proposes DQMIX, a novel value-based MARL method, from a distributional perspective. Specifically, we model each individual Q-value with a categorical distribution. To integrate these individual Q-value distributions into the global Q-value distribution, we design a distribution mixing network, based on five basic operations on the distribution. We further prove that DQMIX satisfies the \emph{Distributional-Individual-Global-Max} (DIGM) principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global Q-value and individual Q-values. To validate DQMIX, we demonstrate its ability to factorize a matrix game with stochastic rewards. Furthermore, the experimental results on a challenging set of StarCraft II micromanagement tasks show that DQMIX consistently outperforms the value-based multi-agent reinforcement learning baselines.

Distributional Reinforcement Learning With Quantile Regression

Implicit Quantile Networks for Distributional Reinforcement Learning

Fully Parameterized Quantile Function for Distributional Reinforcement Learning.

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

A Distributional Perspective on Reinforcement Learning

Distributional Reinforcement Learning for Efficient Exploration

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

Quantile Regression for Distributional Reward Models in RLHF

Value-Distributional Model-Based Reinforcement Learning

A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In Distributional Reinforcement Learning

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

An Analysis of Quantile Temporal-Difference Learning

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Quantile Reinforcement Learning

Policy Evaluation in Distributional LQR (Extended Version)

Single-Trajectory Distributionally Robust Reinforcement Learning

DQMIX: A Distributional Perspective on Multi-Agent Reinforcement Learning

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Normality-Guided Distributional Reinforcement Learning for Continuous Control

How Does Value Distribution in Distributional Reinforcement Learning Help Optimization?