Abstract:Existing value-based algorithms for cooperative multi-agent reinforcement learning (MARL) commonly rely on random exploration, such as $\epsilon$-greedy, to explore the environment. However, such exploration is inefficient at finding effective joint actions in states that require cooperation of multiple agents. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to seamlessly extend value-based MARL algorithms with ensembles of value functions. EMAX leverages the ensemble of value functions to guide the exploration of agents, stabilises their optimisation, and makes their policies more robust to miscoordination. These benefits are achieved by using a combination of three techniques. (1) EMAX uses the uncertainty of value estimates across the ensemble in a UCB policy to guide the exploration. This exploration policy focuses on parts of the environment which require cooperation across agents and, thus, enables agents to more efficiently learn how to cooperate. (2) During the optimisation, EMAX computes target values as average value estimates across the ensemble. These targets exhibit lower variance compared to commonly applied target networks, leading to significant benefits in MARL which commonly suffers from high variance caused by the exploration and non-stationary policies of other agents. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble, which reduces the likelihood of selecting sub-optimal actions. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 60%, 47%, and 539%, respectively, averaged across 21 tasks.

Model-Assisted Reinforcement Learning with Adaptive Ensemble Value Expansion

Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion

Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning

Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function

Model predictive control-based value estimation for efficient reinforcement learning

Model-based Reinforcement Learning with Multi-step Plan Value Estimation

Model-Ensemble Trust-Region Policy Optimization

Model-Based Reinforcement Learning via Meta-Policy Optimization

VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control

A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

Value Gradient weighted Model-Based Reinforcement Learning

Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning

Model Embedding Model-Based Reinforcement Learning

Sample-Efficient Reinforcement Learning Via Conservative Model-Based Actor-Critic.

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

An Efficient Model-Based Approach on Learning Agile Motor Skills without Reinforcement

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

Trust the Model When It Is Confident: Masked Model-based Actor-Critic

Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL

A new deep reinforcement learning model for dynamic portfolio optimization

MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning