Imperfect also Deserves Reward: Multi-Level and Sequential Reward Modeling for Better Dialog Management

Zhengxu Hou,Bang Liu,Ruihui Zhao,Zijing Ou,Yafei Liu,Xi Chen,Yefeng Zheng
DOI: https://doi.org/10.48550/arXiv.2104.04748
2021-04-10
Abstract:For task-oriented dialog systems, training a Reinforcement Learning (RL) based Dialog Management module suffers from low sample efficiency and slow convergence speed due to the sparse rewards in <a class="link-external link-http" href="http://RL.To" rel="external noopener nofollow">this http URL</a> solve this problem, many strategies have been proposed to give proper rewards when training RL, but their rewards lack interpretability and cannot accurately estimate the distribution of state-action pairs in real dialogs. In this paper, we propose a multi-level reward modeling approach that factorizes a reward into a three-level hierarchy: domain, act, and slot. Based on inverse adversarial reinforcement learning, our designed reward model can provide more accurate and explainable reward signals for state-action <a class="link-external link-http" href="http://pairs.Extensive" rel="external noopener nofollow">this http URL</a> evaluations show that our approach can be applied to a wide range of reinforcement learning-based dialog systems and significantly improves both the performance and the speed of convergence.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low sample efficiency and slow convergence speed encountered in the training process of the dialogue management module based on Reinforcement Learning (RL) in task - oriented dialogue systems. These problems are mainly caused by the sparsity of the reward signal in RL training. Specifically, when the traditional reinforcement learning method is used to train the dialogue management module, since the reward signal is very scarce (usually a reward is only given when the task is completed), this leads to a very slow and inefficient learning process of the model. In addition, although many existing reward design strategies can provide certain rewards, these rewards lack interpretability and cannot accurately estimate the distribution of state - action pairs in real - life conversations. To overcome these challenges, this paper proposes a multi - level sequential reward modeling method. By decomposing the reward into three levels (domain, behavior, slot), and using the inverse adversarial reinforcement learning technique to design the reward model, it can provide more accurate and interpretable reward signals for state - action pairs. This method not only improves the performance of the dialogue system but also significantly accelerates the convergence speed of the model.