Centralized Optimization for Dec-POMDPs under the Expected Average Reward Criterion

Xiaofeng Jiang,Xiaodong Wang,Hongsheng Xi,Falin Liu
DOI: https://doi.org/10.1109/tac.2017.2702203
IF: 6.549
2017-01-01
IEEE Transactions on Automatic Control
Abstract:In this paper, the decentralized partially observable Markov decision process (Dec-POMDP) systems with discrete state and action spaces are studied from a gradient point of view. Dec-POMDPs have recently emerged as a promising approach to optimizing multiagent decision making in the partially observable stochastic environment. However, the decentralized nature of the Dec-POMDP framework results in a lack of shared belief state, which makes the decision maker impossible to estimate the system state based on local information. In contrast to the belief-based policy, this paper focuses on optimizing the decentralized observation-based policy, which is easily to be applied and does not have the sharing problem. By analyzing the gradient of the objective function, we have developed a centralized stochastic gradient policy iteration algorithm to find the optimal policy on the basis of gradient estimates from a single sample path. This algorithm does not need any specific assumption and can be applied to most practical Dec-POMDP problems. One numerical example is provided to demonstrate the effectiveness of the algorithm.
What problem does this paper attempt to address?