You should know more: Learning external knowledge for visual dialog

Lei Zhao,Haonan Zhang,Xiangpeng Li,Sen Yang,Yuanfeng Song
DOI: https://doi.org/10.1016/j.neucom.2021.10.121
IF: 6
2022-06-01
Neurocomputing
Abstract:Visual dialog is a task that two agents complete a multi-round conversation based on an image, a caption, and dialog histories. Despite the recent progress, existing methods still undergo degradation on the condition of complex scenarios. Handling these scenarios depends on logical reasoning that requires commonsense priors. In this paper, we propose a novel visual dialog pipeline named Structured Knowledge-Aware Network (SKANet), consisting of an Image Knowledge-Aware Module and a Caption Knowledge-Aware Module. Specifically, the Image and Caption Knowledge-Aware Modules construct commonsense knowledge graphs from ConceptNet. We apply SKANet to two sub-tasks: the conventional visual dialog and a goal-oriented visual dialog named ‘image guessing’. For the conventional visual dialog, the SKANet is combined with an additional Multi-Modality Fusion Module, which is designed to explore the visual content and the textual context about the dialog history. For the goal-oriented visual dialog, we directly apply the Image and Caption Knowledge-Aware Modules to two agents, respectively. Experimental results on VisDial v0.9 and VisDial v1.0 datasets show that our proposed method effectively outperforms comparative methods on both sub-tasks.
computer science, artificial intelligence
What problem does this paper attempt to address?