Macro-Action-Based Multi-Agent/Robot Deep Reinforcement Learning under Partial Observability

Yuchen Xiao
DOI: https://doi.org/10.48550/arXiv.2209.10003
2022-10-11
Abstract:The state-of-the-art multi-agent reinforcement learning (MARL) methods have provided promising solutions to a variety of complex problems. Yet, these methods all assume that agents perform synchronized primitive-action executions so that they are not genuinely scalable to long-horizon real-world multi-agent/robot tasks that inherently require agents/robots to asynchronously reason about high-level action selection at varying time durations. The Macro-Action Decentralized Partially Observable Markov Decision Process (MacDec-POMDP) is a general formalization for asynchronous decision-making under uncertainty in fully cooperative multi-agent tasks. In this thesis, we first propose a group of value-based RL approaches for MacDec-POMDPs, where agents are allowed to perform asynchronous learning and decision-making with macro-action-value functions in three paradigms: decentralized learning and control, centralized learning and control, and centralized training for decentralized execution (CTDE). Building on the above work, we formulate a set of macro-action-based policy gradient algorithms under the three training paradigms, where agents are allowed to directly optimize their parameterized policies in an asynchronous manner. We evaluate our methods both in simulation and on real robots over a variety of realistic domains. Empirical results demonstrate the superiority of our approaches in large multi-agent problems and validate the effectiveness of our algorithms for learning high-quality and asynchronous solutions with macro-actions.
Artificial Intelligence,Multiagent Systems,Robotics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of existing reinforcement learning methods when dealing with large - scale, long - cycle multi - agent tasks in the real world. Specifically, most of the existing multi - agent reinforcement learning (MARL) methods assume that agents execute basic actions synchronously at each time step, which makes it difficult for them to be extended to real - world multi - agent/robot tasks that require agents to make high - level decisions asynchronously. The method proposed in the paper aims to solve this problem by introducing macro - actions, that is, actions that can represent high - level control strategies (such as navigating to a certain point or grasping an object). The introduction of macro - actions allows agents to start and end their own high - level actions at different time steps, thus achieving asynchronous decision - making. The main contributions of the paper lie in proposing several deep reinforcement learning methods based on macro - actions, which can achieve effective asynchronous learning and decision - making in multi - agent/robot tasks in partially observable environments. Specifically, they include: 1. **Value function method based on macro - actions**: Proposed value - based methods for three paradigms: decentralized learning and control, centralized learning and control, and centralized training with decentralized execution (CTDE) that are suitable for macro - actions. 2. **Policy gradient algorithm based on macro - actions**: Under the above three training paradigms, a macro - action policy gradient algorithm that directly optimizes the parameterized policy has been developed, enabling agents to directly optimize their policies in an asynchronous manner. 3. **Experimental verification**: Simulation and actual robot experiments have been carried out in a variety of real - life scenarios, demonstrating the superiority and effectiveness of the proposed methods in large - scale multi - agent problems, especially the ability to learn high - quality asynchronous solutions. Through these methods, the paper aims to provide a more flexible, efficient and robust framework for solving complex multi - agent/robot cooperation tasks in the real world.