End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

Jason D. Williams,Geoffrey Zweig
DOI: https://doi.org/10.48550/arXiv.1606.01269
2016-06-04
Abstract:This paper presents a model for end-to-end learning of task-oriented dialog systems. The main component of the model is a recurrent neural network (an LSTM), which maps from raw dialog history directly to a distribution over system actions. The LSTM automatically infers a representation of dialog history, which relieves the system developer of much of the manual feature engineering of dialog state. In addition, the developer can provide software that expresses business rules and provides access to programmatic APIs, enabling the LSTM to take actions in the real world on behalf of the user. The LSTM can be optimized using supervised learning (SL), where a domain expert provides example dialogs which the LSTM should imitate; or using reinforcement learning (RL), where the system improves by interacting directly with end users. Experiments show that SL and RL are complementary: SL alone can derive a reasonable initial policy from a small number of training dialogs; and starting RL optimization with a policy trained with SL substantially accelerates the learning rate of RL.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve end - to - end learning in task - oriented dialogue systems. Specifically, the author proposes a model based on Long Short - Term Memory network (LSTM), which can directly map from the original dialogue history to the probability distribution of system actions, thus reducing the burden on system developers in dialogue state feature engineering. In addition, this model is optimized through supervised learning (SL) and reinforcement learning (RL) to improve the performance of the dialogue system. ### Main Contributions 1. **End - to - End Learning**: The paper proposes an end - to - end dialogue control system that can directly generate system actions from the user's input sequence without manually designing complex dialogue state representations. 2. **Automatic State Representation**: LSTM automatically infers the representation of dialogue history, reducing the amount of manual feature engineering work. 3. **Combination of Supervised Learning and Reinforcement Learning**: The model can quickly obtain a reasonable initial policy through supervised learning and further optimize it through reinforcement learning, accelerating the learning process. 4. **Practical Application**: The model can not only handle text actions, but also call programmatic APIs to perform practical operations, such as making phone calls or booking restaurants. ### Method Overview - **Model Structure**: - **LSTM**: As the main component, it is responsible for extracting features from the dialogue history and selecting appropriate actions. - **Domain - Specific Software**: Provides business rules and logic, as well as an interface to access any API. - **Language Understanding Module**: Used to extract entities and parse user input. - **Training Method**: - **Supervised Learning**: Trained with example dialogues provided by experts, making LSTM learn to imitate these dialogues. - **Reinforcement Learning**: Optimized according to the success rate of the dialogue through interaction with users, further improving the system's performance. ### Experimental Results - **Supervised Learning**: - Trained with a small amount of dialogue data (1 - 20 pieces), LSTM can achieve high prediction accuracy in new dialogues. - Through cross - validation experiments, the generalization ability of the model under different amounts of training data is verified. - **Reinforcement Learning**: - Optimized by the policy gradient method, the task completion rate of the system is improved. - The combination of reinforcement learning and supervised learning significantly accelerates the learning process and improves the overall performance of the system. ### Conclusion The end - to - end dialogue control system based on LSTM proposed in this paper can achieve efficient learning and optimization in task - oriented dialogue systems by combining supervised learning and reinforcement learning. This method not only simplifies the representation of dialogue states, but also improves the performance and adaptability of the system.