Abstract:Reinforcement learning (RL) has emerged as a key technique for designing dialogue policies. However, action space inflation in dialogue tasks has led to a heavy decision burden and incoherence problems for dialogue policies. In this paper, we propose a novel decomposed deep Q-network (D2Q) that exploits the natural structure of dialogue actions to perform decomposition on Q-function, realizing efficient and coherent dialogue policy learning. Instead of directly evaluating the Q-function, it consists of two separate estimators, one for the abstract action-value functions and the other for the specific action-value functions, both sharing a common feature layer. The abstract action-value function determines the speech act of the system action, while the specific action-value function focuses on the concrete action. This structure establishes a logical relationship between the user and the system on speech actions, avoiding the problem of incoherence. Moreover, the abstract action-value function shields unreasonable specific actions in the inflated action space, reducing the decision complexity. Our results show that the problem of incoherence is prevalent in existing approaches, which significantly impacts the efficiency and quality of dialogue policy learning. Our D2Q architecture alleviates this problem and performs significantly better than competitive baselines in both evaluated and human experiments. Further experiments validate the generality of our method. It can be easily extended to other RL-based dialogue policy approaches.

Natural Language Understanding Discriminator System Action ( Policy ) Semantic Frame State Representation Real Experience Dialogue State Tracking Dialogue Policy Learning Natural Language Generation Simulated Experience

Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning

Replicating Complex Dialogue Policy of Humans Via Offline Imitation Learning with Supervised Regularization.

Switch-Based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning.

Emotion-sensitive deep dyna-Q learning for task-completion dialogue policy learning

Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning.

Deep Reinforcement Learning for Dialogue Generation

Data Distillation for Controlling Specificity in Dialogue Generation.

Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

Decomposed Deep Q-Network for Coherent Task-Oriented Dialogue Policy Learning

End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager

Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization.

Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

DDMA: Discrepancy-Driven Multi-agent Reinforcement Learning

Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator

Policy-driven Knowledge Selection and Response Generation for Document-grounded Dialogue

D2PO: Discriminator-Guided DPO with Response Evaluation Models

Dialogue Strategy Adaptation to New Action Sets Using Multi-dimensional Modelling