Abstract:In multi-agent reinforcement learning, optimal control with robustness guarantees are critical for its deployment in real world. However, existing methods face challenges related to sample complexity, training instability, potential suboptimal Nash Equilibrium convergence and non-robustness to multiple perturbations. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective optimal for MARL. Based on the MaxEnt framework, we propose \emph{Heterogeneous-Agent Soft Actor-Critic} (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to \emph{quantal response equilibrium} (QRE) properties of HASAC. Furthermore, HASAC is provably robust against a wide range of real-world uncertainties, including perturbations in rewards, environment dynamics, states, and actions. Finally, we generalize a unified template for MaxEnt algorithmic design named \emph{Maximum Entropy Heterogeneous-Agent Mirror Learning} (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on seven benchmarks: Bi-DexHands, Multi-Agent MuJoCo, Pursuit-Evade, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines in 34 out of 38 tasks, exhibiting improved training stability, better sample efficiency and sufficient exploration. The robustness of HASAC was further validated when encountering uncertainties in rewards, dynamics, states, and actions of 14 magnitudes, and real-world deployment in a multi-robot arena against these four types of uncertainties. See our page at \url{<a class="link-external link-https" href="https://sites.google.com/view/meharl" rel="external noopener nofollow">this https URL</a>}.

Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning

Qauxi: Cooperative Multi-Agent Reinforcement Learning with Knowledge Transferred from Auxiliary Task

SQIX: QMIX Algorithm Activated by General Softmax Operator for Cooperative Multiagent Reinforcement Learning

Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning

PPS-QMIX: Periodically Parameter Sharing for Accelerating Convergence of Multi-Agent Reinforcement Learning

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

Rules-PPO-QMIX: Multi-Agent Reinforcement Learning with Mixed Rules for Large Scene Tasks

Learning Multi-Agent Cooperation via Considering Actions of Teammates

Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning.

MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Sample-Efficient Multi-Agent RL: an Optimization Perspective.

Regularized Softmax Deep Multi-Agent Q-Learning.

Adaptive Individual Q-Learning-A Multiagent Reinforcement Learning Method for Coordination Optimization

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

Multi-Agent Constrained Policy Optimisation

Multiagent Q-learning with Sub-Team Coordination.

Optimistic sequential multi-agent reinforcement learning with motivational communication

Robust Multi-Agent Control via Maximum Entropy Heterogeneous-Agent Reinforcement Learning

ISFORS-MIX: Multi-Agent Reinforcement Learning with Importance-Sampling-Free Off-policy learning and Regularized-Softmax Mixing Network

Priority over Quantity: A Self-Incentive Credit Assignment Scheme for Cooperative Multiagent Reinforcement Learning