On-line policy optimisation of spoken dialogue systems via live interaction with human subjects

Milica Gasic,Filip Jurcícek,Blaise Thomson,Kai Yu,Steve Young
DOI: https://doi.org/10.1109/ASRU.2011.6163950
2011-01-01
Abstract:Statistical dialogue models have required a large number of dialogues to optimise the dialogue policy, relying on the use of a simulated user. This results in a mismatch between training and live conditions, and significant development costs for the simulator thereby mitigating many of the claimed benefits of such models. Recent work on Gaussian process reinforcement learning, has shown that learning can be substantially accelerated. This paper reports on an experiment to learn a policy for a real-world task directly from human interaction using rewards provided by users. It shows that a usable policy can be learnt in just a few hundred dialogues without needing a user simulator and, using a learning strategy that reduces the risk of taking bad actions. The paper also investigates adaptation behaviour when the system continues learning for several thousand dialogues and highlights the need for robustness to noisy rewards.
What problem does this paper attempt to address?