Visual Question Generation Under Multi-granularity Cross-Modal Interaction.

Zi Chai,Xiaojun Wan,Soyeon Caren Han,Josiah Poon
DOI: https://doi.org/10.1007/978-3-031-27077-2_20
2023-01-01
Abstract:Visual question generation (VQG) aims to ask human-like questions automatically from input images targeting on given answers. A key issue of VQG is performing effective cross-modal interaction, i.e., dynamically focus on answer-related regions during question. In this paper, we propose a novel framework based on multi-granularity cross-modal interaction for VQG containing both object-level and relation-level interaction. For object-level interaction, we leverage both semantic and visual features under a contrastive learning scenario. We further illustrate the importance of high-level relations (e.g., spatial, semantic) between regions and answers for generating deeper questions. Since such information were somewhat ignored by prior VQG studies, we propose relation-level interaction based on graph neural networks. We perform experiments on VQA2.0 and Visual7w datasets under automatic and human evaluations and our model outperforms all baseline models.
What problem does this paper attempt to address?