Imperfect also Deserves Reward: Multi-Level and Sequential Reward Modeling for Better Dialog Management

Zhengxu Hou,Bang Liu,Ruihui Zhao,Zijing Ou,Yafei Liu,Xi Chen,Yefeng Zheng

DOI: https://doi.org/10.48550/arXiv.2104.04748

2021-04-10

Abstract:For task-oriented dialog systems, training a Reinforcement Learning (RL) based Dialog Management module suffers from low sample efficiency and slow convergence speed due to the sparse rewards in <a class="link-external link-http" href="http://RL.To" rel="external noopener nofollow">this http URL</a> solve this problem, many strategies have been proposed to give proper rewards when training RL, but their rewards lack interpretability and cannot accurately estimate the distribution of state-action pairs in real dialogs. In this paper, we propose a multi-level reward modeling approach that factorizes a reward into a three-level hierarchy: domain, act, and slot. Based on inverse adversarial reinforcement learning, our designed reward model can provide more accurate and explainable reward signals for state-action <a class="link-external link-http" href="http://pairs.Extensive" rel="external noopener nofollow">this http URL</a> evaluations show that our approach can be applied to a wide range of reinforcement learning-based dialog systems and significantly improves both the performance and the speed of convergence.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the low sample efficiency and slow convergence speed encountered in the training process of the dialogue management module based on Reinforcement Learning (RL) in task - oriented dialogue systems. These problems are mainly caused by the sparsity of the reward signal in RL training. Specifically, when the traditional reinforcement learning method is used to train the dialogue management module, since the reward signal is very scarce (usually a reward is only given when the task is completed), this leads to a very slow and inefficient learning process of the model. In addition, although many existing reward design strategies can provide certain rewards, these rewards lack interpretability and cannot accurately estimate the distribution of state - action pairs in real - life conversations. To overcome these challenges, this paper proposes a multi - level sequential reward modeling method. By decomposing the reward into three levels (domain, behavior, slot), and using the inverse adversarial reinforcement learning technique to design the reward model, it can provide more accurate and interpretable reward signals for state - action pairs. This method not only improves the performance of the dialogue system but also significantly accelerates the convergence speed of the model.

Imperfect also Deserves Reward: Multi-Level and Sequential Reward Modeling for Better Dialog Management

Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue

Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Reward Mechanism Design for Deep Reinforcement Learning-Based Microgrid Energy Management

Enhancing End-to-End Multi-Task Dialogue Systems: A Study on Intrinsic Motivation Reinforcement Learning Algorithms for Improved Training and Adaptability

Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards

Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learning and Human Demonstration

Reward Modeling Requires Automatic Adjustment Based on Data Quality

Semi-Supervised Dialogue Policy Learning Via Stochastic Reward Estimation

Reward Design with Language Models

Towards Understanding the Influence of Reward Margin on Preference Model Performance

Secrets of RLHF in Large Language Models Part II: Reward Modeling

An emotion-sensitive dialogue policy for task-oriented dialogue system

Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

Multi-Objective Intrinsic Reward Learning for Conversational Recommender Systems

What does the User Want? Information Gain for Hierarchical Dialogue Policy Optimisation

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Integrating Pretrained Language Model for Dialogue Policy Learning

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

An Asynchronous Updating Reinforcement Learning Framework for Task-oriented Dialog System