Macro-Action-Based Deep Multi-Agent Reinforcement Learning

Yuchen Xiao,Joshua Hoffman,Christopher Amato
DOI: https://doi.org/10.48550/arXiv.2004.08646
2021-10-17
Abstract:In real-world multi-robot systems, performing high-quality, collaborative behaviors requires robots to asynchronously reason about high-level action selection at varying time durations. Macro-Action Decentralized Partially Observable Markov Decision Processes (MacDec-POMDPs) provide a general framework for asynchronous decision making under uncertainty in fully cooperative multi-agent tasks. However, multi-agent deep reinforcement learning methods have only been developed for (synchronous) primitive-action problems. This paper proposes two Deep Q-Network (DQN) based methods for learning decentralized and centralized macro-action-value functions with novel macro-action trajectory replay buffers introduced for each case. Evaluations on benchmark problems and a larger domain demonstrate the advantage of learning with macro-actions over primitive-actions and the scalability of our approaches.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to achieve high - quality collaborative behaviors through asynchronous selection and execution of high - level actions (macro - actions) in multi - robot systems. Specifically, the existing multi - agent deep reinforcement learning methods mainly focus on synchronous, primitive actions problems, while robots in practical applications usually need to independently select and complete high - level actions within different time intervals. Therefore, this paper proposes two methods based on Deep Q - Network (DQN) for learning decentralized and centralized macro - action value functions, and introduces new Macro - Action Concurrent Experience Replay Trajectories (Mac - CERTs) and Macro - Action Joint Experience Replay Trajectories (Mac - JERTs) to support these learning methods. ### Specific Problems and Solutions 1. **Asynchronous Decision - making and Execution**: - **Problem**: Existing methods assume that all agents' actions are synchronously executed basic operations, while in the real world, robots often select and complete actions at different times in cooperative tasks. - **Solution**: Introduce macro - actions, allowing agents to asynchronously select and terminate actions at different times, thus more naturally representing high - level control strategies (such as navigating to a target point or grasping an object). 2. **Multi - agent Reinforcement Learning under Partial Observability**: - **Problem**: How to effectively learn and make decisions in a partially observable multi - agent environment. - **Solution**: Extend MacDec - POMDPs (Macro - Action Decentralized Partially Observable Markov Decision Processes), and propose a DQN - based learning method to adapt to this complex environment. 3. **Decentralized and Centralized Learning**: - **Problem**: How to effectively learn in decentralized and centralized scenarios respectively. - **Solution**: - **Decentralized Learning**: Propose a DQN - based decentralized macro - action learning method, generate Macro - Action Concurrent Experience Replay Trajectories (Mac - CERTs), so that each agent can maintain its own macro - action trajectory. - **Centralized Learning**: Propose a DQN - based centralized macro - action learning method, generate Macro - Action Joint Experience Replay Trajectories (Mac - JERTs) to maintain the time information of macro - action trajectories, and introduce a conditional target prediction method to learn the centralized joint macro - action value function. ### Experimental Verification The paper is evaluated in multiple benchmark problems and larger - scale tasks, demonstrating the advantages of using macro - action learning and the scalability of the methods. The experimental results show that macro - action learning can achieve higher performance compared to primitive - action learning, and also shows good effects in larger environmental spaces. ### Summary This paper solves the learning problem of asynchronous collaborative behaviors in multi - robot systems by introducing macro - actions and corresponding learning mechanisms, especially in partially observable multi - agent environments, providing more efficient and flexible solutions.