Multimodal Fusion of Visual Dialog: A Survey

Xiaofan Chen,Songyang Lao,Ting Duan
DOI: https://doi.org/10.1145/3438872.3439098
2020-01-01
Abstract:Visual Dialog: aiming at holding a meaningful conversation with humans based on natural images, is a 'high-level' AI task of multimodal fusion. Since the challenge for visual dialog was proposed in 2017, multimodal fusion has been developed and made significant breakthroughs with the help of deep learning techniques. The goal of this paper is to provide a comprehensive survey of the recent achievements in the Visual Dialog task. This survey covers many aspects of multimodal fusion research: Visual Co-reference Resolution, Attention Mechanism, Graph Neural Networks, evaluation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance.
What problem does this paper attempt to address?