MaTCR: Modality-Aligned Thought Chain Reasoning for Multimodal Task-Oriented Dialogue Generation

Yiting Liu,Liang Li,Beichen Zhang,Shan Huang,Zheng-Jun Zha,Qingming Huang
DOI: https://doi.org/10.1145/3581783.3612268
2023-01-01
Abstract:In recent years, multimodal task-oriented dialogue systems have attracted increasing attention from communities, owing to their ability to naturally and efficiently provide user service. Despite the commercial value of multimodal dialogue systems, they are still confronted with two challenges: (1) capture users' intention from lengthy context and side knowledge for question comprehension; (2) jointly consider the multimodal information for response generation. In view of the challenges, previous methods designed for specific scenario lack auxiliary reasoning structures with effective modality interaction, which hinders the comprehension of user's needs and impedes the generation of desired responses. To address these issues, we propose a Modality-aligned Thought Chain Reasoning (MaTCR) framework to insert explicit reasoning process for multimodal task-oriented dialogue generation. We construct a multimodal thought chain by summarizing intermediate user queries from aligned visual and textual context, which helps to guide the comprehension of user intentions for generating reasonable responses. To effectively extract and integrate multimodal information for high-quality thought chain reasoning, we design a multimodal reasoner consisting of visual representation learning and modality-aligned fusion. We comparatively justify MaTCR with several strong baselines, including the currently highly regarded large language model. Extensive experiments over a benchmark dataset demonstrate that MaTCR outperforms the existing methods and provides stronger interpretability.
What problem does this paper attempt to address?