Abstract:This paper presents a novel algorithm for learning parameters in statistical dialogue systems which are modelled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy which selects the system's responses based on the inferred state; and a reward function which specifies the desired behaviour of the system. Ideally both the model parameters and the policy would be designed to maximise the reward function. However, whilst there are many techniques available for learning the optimal policy, there are no good ways of learning the optimal model parameters that scale to real-world dialogue systems.The Natural Belief-Critic (NBC) algorithm presented in this paper is a policy gradient method which offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected reward. The resulting gradient is then used to adapt the prior distribution of the dialogue model parameters. The algorithm is evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximise the reward function result in significantly improved performance compared to the baseline handcrafted parameters.

Uncertainty Estimates for Efficient Neural Network-based Dialogue Policy Optimisation

Gaussian processes for fast policy optimisation of POMDP-based dialogue managers

Deep Reinforcement Learning for Dialogue Generation

Hyper-parameter Optimisation of Gaussian Process Reinforcement Learning for Statistical Dialogue Management.

On-line policy optimisation of spoken dialogue systems via live interaction with human subjects

Gaussian Process Based Deep Dyna-Q Approach for Dialogue Policy Learning.

Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System

Natural Belief-Critic: a Reinforcement Algorithm for Parameter Estimation in Statistical Spoken Dialogue Systems.

Improving Interaction Quality Estimation with BiLSTMs and the Impact on Dialogue Policy Learning

What does the User Want? Information Gain for Hierarchical Dialogue Policy Optimisation

Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach

Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization.

Policy Networks with Two-Stage Training for Dialogue Systems

User Study of the Bayesian Update of Dialogue State Approach to Dialogue Management

The Uncertainty Bellman Equation and Exploration

Optimizing human-interpretable dialog management policy using Genetic Algorithm

Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning.

Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management

Adversarial learning of neural user simulators for dialogue policy optimisation

Statistical Methods for Building Robust Spoken Dialogue Systems in an Automobile

Few-Shot Structured Policy Learning for Multi-Domain and Multi-Task Dialogues