Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization.

Chenping Huang,Bin Cao
DOI: https://doi.org/10.1007/978-3-031-24383-7_22
2022-01-01
Abstract:Many methods have been proposed to use reinforcement learning to train dialogue policy for task-oriented dialogue systems in recent years. However, the high cost of interacting with users has seriously hindered the development of this field. In order to reduce this interaction cost, the Deep Dyna-Q (DDQ) algorithm and several variants introduce a so-called world model to simulate the user's response and then use the generated simulated dialogue data to train the dialogue policy. Nevertheless, these methods suffer from two main issues. The first is limited training efficiency due to the Deep-Q Network used. The second is that low-quality simulation dialogue data generated by the world model may hurt the performance of the dialogue policy. To solve these drawbacks, we propose the Dyna Proximal Policy Optimization (DPPO) algorithm. DPPO combines the Proximal Policy Optimization (PPO) algorithm with the world model and uses a deactivation strategy to decide when to stop using the world model for subsequent training. We have conducted experiments on the task of movie ticket booking. Experiments show that our algorithm combines the advantages of DDQ and PPO, which significantly reduces the interaction cost required during training and has a higher task success rate.
What problem does this paper attempt to address?