A multimodal dialogue system for improving user satisfaction via knowledge-enriched response and image recommendation

Wang, Jiangnan
DOI: https://doi.org/10.1007/s00521-023-08409-z
2023-03-11
Neural Computing and Applications
Abstract:Task-oriented multimodal dialogue systems have important application value and development prospects. Existing methods have made significant progress, but the following challenges still exist: (1) Most existing methods focus on improving the accuracy of dialogue state tracking and dialogue act prediction. However, the essential to leverage knowledge in the knowledge base to supplement textual responses in multi-turn dialogues is ignored. (2) One feature that distinguishes multimodal dialogue from plain text dialogue is the usage of visual information. However, existing methods ignore the importance of accurately providing visual information to improve user satisfaction. (3) For multimodal dialogue systems, most existing methods ignore the classification of response types to assign appropriate response generators automatically. To address the issues above, we present a user-satisfactory multimodal dialogue system, USMD for short. Specifically, USMD is designed as four modules. The general response generator is based on generative pre-training 2.0 (GPT-2) to generate dialogue acts and general textual responses. The knowledge-enriched response generator is designed to leverage a structured knowledge base under the guidance of a knowledge graph. The image recommender pays attention to both latent and explicit visual cues, a deep multimodal fusion model to obtain informative image representations. Finally, the response classifier automatically selects the appropriate generators to answer the user based on user and agent actions. Extensive experiments on the benchmark multimodal dialogue datasets show that the proposed USMD model achieves state-of-the-art performance.
computer science, artificial intelligence
What problem does this paper attempt to address?