Infer unseen from seen: Relation regularized zero-shot visual dialog

Zefan Zhang,Shun Li,Yi Ji,Chunping Liu
DOI: https://doi.org/10.1016/j.jvcir.2023.103961
IF: 2.887
2023-01-01
Journal of Visual Communication and Image Representation
Abstract:The Visual Dialog task requires retrieving the correct answer based on detected objects, a current question, and history dialogs. However, in real-world scenarios, most existing models face the hard-positive problem and are unable to reason about unseen features, which limits their generalization ability. To address this issue, we propose two Relation Regularized Modules (RRM) in this article. The first is the Visual Relation Regularized Module (VRRM), which seeks known visual features that have semantic relations with unknown visual features and leverages these known features to assist in understanding the unknown features. The second is the Text Relation Regularized Module (TRRM), which enhances the keywords in the answers to strengthen the understanding of unknown text features. To evaluate the effectiveness of these modules, we propose two zero-shot Visual Dialog splits for verification: Visual Zero-shot VisDial with unseen visual features and Text Zero-shot VisDial with unseen answers. Experimental results demonstrate that our proposed modules achieve state-of-the-art performance in zero-shot Visual Dialog with unseen visual features and unseen answers, while also producing comparable results on the benchmark VisDial v1.0 test dataset.
What problem does this paper attempt to address?