Belief State Actor-Critic Algorithm from Separation Principle for POMDP.

Yujie Yang,Yuxuan Jiang,Jianyu Chen,Shengbo Eben Li,Ziqing Gu,Yuming Yin,Qian Zhang,Kai Yu
DOI: https://doi.org/10.23919/acc55779.2023.10155792
2023-01-01
Abstract:Partially observable Markov decision process (POMDP) is a general framework for decision making and control under uncertainty. A large class of POMDP algorithms follows a two-step approach, in which the first step is to estimate the belief state, and the second step is to solve for the optimal policy taking the belief state as input. The optimality guarantee of their combination relies on the so-called separation principle. In this paper, we propose a new path to prove the separation principle for infinite horizon general POMDP problems under both discounted cost and average cost. We use a nominal horizon to split a virtual objective function into two parts and prove that it converges to the optimal state-value function. Based on the separation principle, we design a two-step POMDP algorithm called Belief State Actor-Critic (BSAC), which first estimates the belief state and then takes it as input to solve for the optimal policy. The belief state is learned using variational inference, and the policy is learned through model-based reinforcement learning. We test our algorithm in a partially observable multi-lane autonomous driving task. Results show that our algorithm achieves lower costs than the baselines and learns safe, efficient, and smooth driving behaviors.
What problem does this paper attempt to address?