Deep Recurrent Policy Networks for Planning under Partial Observability.

Zixuan Chen,Zongzhang Zhang
DOI: https://doi.org/10.1007/978-3-030-30487-4_46
2019-01-01
Abstract:QMDP-net is a recurrent network architecture that combines the features of model-free learning and model-based planning for planning under partial observability. The architecture represents a policy by connecting a partially observable Markov decision process (POMDP) model with the QMDP algorithm that uses value iteration to handle the POMDP model. However, as the value iteration used in QMDP iterates through the entire state space, it may suffer from the “curse of dimensionality”. Besides, as the policies based on the QMDP will not take actions to gain information, this may lead to bad policies in domains where information gathering is necessary. To address these two issues, this paper introduces two deep recurrent policy networks, asynchronous QMDP-net and ReplicatedQ-net, based on the plain QMDP-net. The former takes advantage of the idea of asynchronous update into the value iteration process of QMDP to learn a smaller abstract state space representation for planning. The latter partially replaces the QMDP with the replicated Q-learning algorithm to take informative actions. Experimental results demonstrate the proposed networks perform better than the plain QMDP-net on the robotic tasks in simulation.
What problem does this paper attempt to address?