Online learning for Markov decision processes applied to multi-agent systems

Mahmoud El Chamie,Behçet Açikmese,M. Mesbahi
DOI: https://doi.org/10.1109/CDC.2017.8263879
2017-12-01
Abstract:Online learning is the process of providing online control decisions in sequential decision-making problems given (possibly partial) knowledge about the optimal controls for the past decision epochs. The purpose of this paper is to apply the online learning techniques on finite-state finite-action Markov Decision Processes (finite MDPs). We consider a multi-agent system composed of a learning agent and observed agents. The learning agent observes from the other agents the state probability distribution (pd) resulting from a stationary policy but not the policy itself. The state pd is observed either directly from an observed agent or through the density distribution of the multi-agent system. We show that using online learning, the learned policy performs at least as well as the one of the observed agents. Specifically, this paper shows that if the observed agents are running an optimal policy, the learning agent can learn the optimal average expected cost MDP policies via online learning techniques by using a descent gradient algorithm on the observed agents' pd data.
What problem does this paper attempt to address?