Decentralized Multi-Agent Policy Evaluation over Directed Graphs

Qifeng Lin,Qing Ling
DOI: https://doi.org/10.23919/ccc55666.2022.9902191
2022-01-01
Abstract:Policy evaluation is one of the critical topics in multi-agent reinforcement learning (MARL), where the agents with a fixed joint policy cooperatively estimate the global expected accumulative discounted reward. We consider the case that the agents communicate with their neighbors through a decentralized and directed communication network. The communication links are directed due to different wireless transmission powers of the agents and/or packet losses caused by channel uncertainties. Various temporal difference (TD) learning methods have been developed to solve the decentralized policy evaluation problem, but to the best of our knowledge, there is no existing work to consider the directed communication network. In this paper, we propose a directed decentralized TD(0) algorithm, abbreviated as DDec-TD(0), to address this issue. Similar to the decentralized TD(0) algorithm that operates over an undirected graph, in each iteration of DDec-TD(0), each agent combines the local models of its neighbors in a weighted average fashion and then performs a local TD(0) gradient step. The weight matrix is column stochastic for the directed graph other than doubly stochastic for the undirected graph. Therefore, directly applying decentralized TD(0) suffers from remarkable bias. Inspired by the development of decentralized optimization over the directed graph, we further introduce the Push-Sum strategy to eliminate this bias. Numerical experiments demonstrate the effectiveness of our proposed algorithm.
What problem does this paper attempt to address?