Comprehensive Framework of Early and Late Fusion for Image-Sentence Retrieval

Yifan Wang,Xing Xu,Wei Yu,Ruicong Xu,Zuo Cao,Heng Tao Shen
DOI: https://doi.org/10.1109/MMUL.2022.3144972
IF: 3.4911
2022-01-01
IEEE Multimedia
Abstract:Image-text retrieval is one challenging task to bridge the modality gap between vision and language. Although the mainstream late fusion schemes could facilitate intramodality correlations, it would result in heavy burden of computation resources and insufficient intermodal alignment. In this work, we propose comprehensive framework of early and late fusion (CFELF), a universal framework to collaborate early fusion with late fusion. To enhance cross-modal correspondence, CFELF fuses local visual regions with global sentences at the early stage to aggregate on late fusion backbones. Therefore, fusions on two phases of the feature process could be complementary to each other to capture salient information in intramodality while encouraging intermodal alignments. We have extensively evaluated CFELF on four advanced late fusion backbones and compare with other early fusion modules. The results on two public image-text datasets demonstrate the effectiveness of the comprehensive fusion framework in retrieval performance with convergence accelerating.
What problem does this paper attempt to address?