Abstract:The state-of-the-art multi-agent reinforcement learning (MARL) methods have provided promising solutions to a variety of complex problems. Yet, these methods all assume that agents perform synchronized primitive-action executions so that they are not genuinely scalable to long-horizon real-world multi-agent/robot tasks that inherently require agents/robots to asynchronously reason about high-level action selection at varying time durations. The Macro-Action Decentralized Partially Observable Markov Decision Process (MacDec-POMDP) is a general formalization for asynchronous decision-making under uncertainty in fully cooperative multi-agent tasks. In this thesis, we first propose a group of value-based RL approaches for MacDec-POMDPs, where agents are allowed to perform asynchronous learning and decision-making with macro-action-value functions in three paradigms: decentralized learning and control, centralized learning and control, and centralized training for decentralized execution (CTDE). Building on the above work, we formulate a set of macro-action-based policy gradient algorithms under the three training paradigms, where agents are allowed to directly optimize their parameterized policies in an asynchronous manner. We evaluate our methods both in simulation and on real robots over a variety of realistic domains. Empirical results demonstrate the superiority of our approaches in large multi-agent problems and validate the effectiveness of our algorithms for learning high-quality and asynchronous solutions with macro-actions.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing reinforcement learning methods when dealing with large - scale, long - cycle multi - agent tasks in the real world. Specifically, most of the existing multi - agent reinforcement learning (MARL) methods assume that agents execute basic actions synchronously at each time step, which makes it difficult for them to be extended to real - world multi - agent/robot tasks that require agents to make high - level decisions asynchronously. The method proposed in the paper aims to solve this problem by introducing macro - actions, that is, actions that can represent high - level control strategies (such as navigating to a certain point or grasping an object). The introduction of macro - actions allows agents to start and end their own high - level actions at different time steps, thus achieving asynchronous decision - making. The main contributions of the paper lie in proposing several deep reinforcement learning methods based on macro - actions, which can achieve effective asynchronous learning and decision - making in multi - agent/robot tasks in partially observable environments. Specifically, they include: 1. **Value function method based on macro - actions**: Proposed value - based methods for three paradigms: decentralized learning and control, centralized learning and control, and centralized training with decentralized execution (CTDE) that are suitable for macro - actions. 2. **Policy gradient algorithm based on macro - actions**: Under the above three training paradigms, a macro - action policy gradient algorithm that directly optimizes the parameterized policy has been developed, enabling agents to directly optimize their policies in an asynchronous manner. 3. **Experimental verification**: Simulation and actual robot experiments have been carried out in a variety of real - life scenarios, demonstrating the superiority and effectiveness of the proposed methods in large - scale multi - agent problems, especially the ability to learn high - quality asynchronous solutions. Through these methods, the paper aims to provide a more flexible, efficient and robust framework for solving complex multi - agent/robot cooperation tasks in the real world.

Macro-Action-Based Multi-Agent/Robot Deep Reinforcement Learning under Partial Observability

Macro-Action-Based Deep Multi-Agent Reinforcement Learning

Observer-Based Multiagent Deep Reinforcement Learning: A Fully Distributed Training Scheme

Multi-Agent/Robot Deep Reinforcement Learning with Macro-Actions (Student Abstract)

Multiagent Reinforcement Learning for Strictly Constrained Tasks Based on Reward Recorder

Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

Learning for Decentralized Control of Multiagent Systems in Large, Partially-Observable Stochastic Environments

Effective Multi-Agent Deep Reinforcement Learning Control with Relative Entropy Regularization

R-MADDPG for Partially Observable Environments and Limited Communication

Decentralized Control of Partially Observable Markov Decision Processes using Belief Space Macro-actions

Decentralized control of multi-robot partially observable Markov decision processes using belief space macro-actions

Less Is More: Robust Robot Learning via Partially Observable Multi-Agent Reinforcement Learning

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization

Multi-agent Continual Coordination Via Progressive Task Contextualization

Decentralized Multi-agent Reinforcement Learning with Multi-time Scale of Decision Epochs

Multiagent Continual Coordination via Progressive Task Contextualization

Optimal Decision-Making in Mixed-Agent Partially Observable Stochastic Environments via Reinforcement Learning

Deep Reinforcement Learning With Macro-Actions

Multi-Agent Concentrative Coordination with Decentralized Task Representation

Multi-Agent Reinforcement Learning With Decentralized Distribution Correction