DTDE: A New Cooperative Multi-Agent Reinforcement Learning Framework
Guanghui Wen,Junjie Fu,Pengcheng Dai,Jialing Zhou
DOI: https://doi.org/10.1016/j.xinn.2021.100162
2021-01-01
The Innovation
Abstract:A significant body of work on reinforcement learning has been focused on the single-agent tasks where the agent aims to learn a policy that maximizes the cumulative reward in a dynamic environment.1Sutton R.S. Barto A.G. Reinforcement Learning: An Introduction. The MIT Press, 2018Google Scholar In the past decades, quite a few single-agent-based reinforcement learning algorithms have been developed in the literature.1Sutton R.S. Barto A.G. Reinforcement Learning: An Introduction. The MIT Press, 2018Google Scholar Yet, it is increasingly recognized that the single-agent-based reinforcement learning algorithms may fail to effectively handle large-scale optimization (decision) tasks with joint features. Within this context, cooperative multi-agent reinforcement learning (CMARL) algorithms have been proposed, where the agents aim to complete the multi-agent learning goal cooperatively through information exchange between neighboring agents. It has been witnessed in the past few years that CMARL algorithms have received increasing attention due to their broad applications in various fields, such as traffic signal control in intelligent transportation systems, energy management of smart grid, and coordination control of robot swarms. Compared with the single-agent reinforcement learning algorithm that considers only a single agent's state-action space, the joint state-action space of the CMARL algorithm grows exponentially as the number of agents increases.2Tan, M.. (1993). Multi-agent reinforcement learning: independent vs. cooperative agents. In Proc. 10th Int. Conf. Mach. Learn. 330-337.Google Scholar Therefore, CMARL encounters major challenges of algorithm complexity and scalability. Another challenge in designing an efficient CMARL algorithm is the partial observability of the environment in which each agent has to make its individual decisions based on the local observations. Within the context of CMARL, the independent Q-learning (IQL)-based algorithms, where each agent establishes a local Q-function by local state-action information, have been suggested and discussed in the literature.2Tan, M.. (1993). Multi-agent reinforcement learning: independent vs. cooperative agents. In Proc. 10th Int. Conf. Mach. Learn. 330-337.Google Scholar The advantages of several IQL-based CMARL algorithms compared with the independent MARL algorithms have been also examined.2Tan, M.. (1993). Multi-agent reinforcement learning: independent vs. cooperative agents. In Proc. 10th Int. Conf. Mach. Learn. 330-337.Google Scholar Although the IQL has good scalability, it sometimes cannot guarantee the collection of individual optimal actions of agents produced by local Q-functions equivalent to the optimal joint action, that is, the individual global max (IGM) principle may not be satisfied. Motivated partly by this observation, a new kind of CMARL paradigm based on the centralized training with decentralized execution (CTDE) mechanism has recently attracted significant attention. In CTDE, the agents' policies are trained with access to global information in a centralized manner and executed based only upon local observation in a decentralized way. Typical CTDE-based CMARL algorithms include value decomposition networks (VDN), among others.3Sunehag, P., Lever, G., Gruslys, A., et al. (2018). Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proc. 17th Int. Conf. Auton. Agents MultiAgent Syst. 2085-2087.Google Scholar The aforementioned results advanced our knowledge of how to design CTDE-based algorithms for coping with CMARL problems. However, most of the above-mentioned CTDE-based algorithms are preliminarily focused on solving the CMARL problems in the absence of constraints on agents' actions, especially the inherent coupling (joint) constraints. However, due to the inherent complexity of large-scale CMARL problems, the feasible actions of an individual agent are generally affected by those of the other agents. For example, when designing CMARL algorithms to address the economic power dispatch problem for the smart grid in the presence of multiple generating units, the generating units are always modeled as agents whose feasible actions (corresponding to the power outputs) are dependent on each other, as the total power output should equal to the power demand.4Dai P. Yu W. Wen G. et al.Distributed reinforcement learning algorithm for dynamic economic dispatch with unknown generation cost functions.IEEE Trans. Ind. Informat. 2020; 16: 2258-2267Crossref Scopus (29) Google Scholar The capability of addressing the constraints on actions of multiple agents has become an important index of the practicality of CMARL algorithms. In the context of distributed consensus of multi-agent systems (MASs), various distributed information exchange protocols have been constructed and embedded into the agents such that the states of all agents will converge to a common value.5Ren W. Beard R.W. Distributed Consensus in Multi-Vehicle Cooperative Control. Springer-Verlag, 2008Crossref Google Scholar During the consensus-seeking process, the agents equipped with distributed information exchange protocols will share information with their neighbors through the underlying communication network among them. Based on the ideas of CTDE and distributed consensus of MASs, a distributed reinforcement learning-based framework called distributed training with decentralized execution (DTDE) is envisioned in this paper. Specifically, the DTDE structure suggested in this work is shown in Figure 1. Specifically, different from the method of CTDE, where each individual agent can obtain the global state s and the common (joint) reward r directly, each agent i in the present DTDE algorithm will employ a “consensus protocol” to respectively estimate the global state s and the average reward of the agents by using only local information. Certainly, the global state and the average reward are defined according to the specific MARL tasks under consideration. For example, the total power demand can be selected as the global state, while the reciprocal value of the averaged generation cost of all agents can be selected as the average reward, when using CMARL algorithms to solve the economic power dispatch problem for a smart grid in the presence of multiple generating units (agents), where the CMARL task is to minimize the total generation cost of the agents while satisfying the power balance condition.4Dai P. Yu W. Wen G. et al.Distributed reinforcement learning algorithm for dynamic economic dispatch with unknown generation cost functions.IEEE Trans. Ind. Informat. 2020; 16: 2258-2267Crossref Scopus (29) Google Scholar For each agent i, the estimates of the global state s and the average reward are denoted respectively by sˆi and rˆi in Figure 1. Agents could find the feasible joint action that satisfies the action constraints through “distributed exploration” over the underlying communication network. Each agent i constructs the local Q-value function, which is defined by the information of the estimated global state sˆi and the local action ai. Based on the technique of distributed optimization, the values of argmaxaQtot(sˆ,a;θ) and maxa′Qtotsˆ′,a′;θ- under action constraints can be calculated in a distributed way through the underlying communication network where ŝ′ is the consensus value of the next global state estimation produced by s and a, θ−= (θ1−,…, θN−) are the parameters of a target network and periodically copied from θ as in a deep Q-network. The key steps of DTDE are provided as follows: Step 1. Each agent i obtains the observation oi from the environment and employs the local information sˆi and sˆNi, which is obtained from the communication network to estimate the global state s through the “consensus protocol” in a distributed way. By the theory of distributed consensus, sˆi could generally reach a consistent value sˆ, which equals to the global state s under the condition that the underlying communication network is undirected and connected. In addition, each agent i utilizes the information of rˆNi, which comes from the neighbors of agent i to estimate the average reward. Step 2. Agents employ the ϵ-greedy policy in the training and execution process. In the exploration step of the agents, each agent i utilizes the local information sˆi and aNic, which is obtained from the communication network to find a feasible joint action through “distributed exploration.” It should be noted that aNic is the information coming from the communication network, which is iteratively updated until the joint action a=(a1,⋯,aN) is a feasible action. In the exploitation step of the agents, the agents choose argmaxaQtot(sˆ,a;θ) to execute. Step 3. Each agent i updates the local parameter θi by minimizing the temporal difference (TD) loss Ltotθ, which is denoted as Ltotθ=r+γmaxa′Qtotsˆ′,a′;θ--Qtotsˆ,a;θ2, where θ−=(θ1−,⋯,θN−) are the parameters of a target network and periodically copied from θ, as in a deep Q-network. It should be noted that the value of maxa′Qtotsˆ′,a′;θ- in the presence of action constraints can be calculated by the technique of “distributed optimization.” Compared with the CTDE-based CMARL algorithms, the potential advantages of the DTDE-based CMARL algorithms include:1As the agents can estimate the global state through distributed algorithms based on the local observations, the common assumption made in executing the CTDE-based CMARL algorithms that the agents know the global information of state can be successfully removed from the DTDE-based framework. Furthermore, the distributed training structure in the present DTDE-based framework could improve the robustness of the system. The requirement of privacy preservation may also be ensured by designing distributed privacy-preserving information exchange protocols during the practical implementation.2Different from the CTDE-based algorithms that are commonly utilized to solve the CMARL tasks in the absence of action constraint, the DTDE-based algorithms can deal with the CMARL tasks in the presence of constraints on agents' actions. More specifically, the DTDE-based algorithms can calculate the value of maxa′Qtotsˆ′,a′;θ- in a distributed way under the joint action constraints (e.g., using the distributed optimization technique for MASs).3Different from decentralized execution in the CTDE-based CMARL algorithms based on the IGM principle, the decentralized execution in the DTDE-based CMARL algorithms is realized by the distributed optimization technique. Therefore, the optimality of the learned policy can be generally preserved. To harvest the advantages of DTDE-based CMARL algorithms, the key is to handle the partial observability constraints using the distributed interaction. Individual agents should obtain a common estimation of the global state that reflects information about the whole environment and then make the best decision during each time step. Theoretical guarantees of several typical DTDE-based CMARL algorithms can be established in the future to solidify the proposed CMRAL framework. The authors declare no competing interests.