Multi-view inter-modality representation with progressive fusion for image-text matching

Jie Wu,Leiquan Wang,Chenglizhao Chen,Jing Lu,Chunlei Wu
DOI: https://doi.org/10.1016/j.neucom.2023.02.043
IF: 6
2023-01-01
Neurocomputing
Abstract:Recently, image-text matching has been intensively explored to bridge vision and language. Previous methods explore an inter-modality relationship between an image-text pair from the single-view feature. However, it is difficult to discover all the abundant information based on a single inter-modality relation-ship. In this paper, a novel Multi-View Inter-Modality Representation with Progressive Fusion (MIRPF) is developed to explore inter-modality relationships from multi-view features. The multi-view strategy provides more complementary and global semantic clues than single-view approaches. In particular, the multi-view inter-modality representation network is constructed to generate multiple inter -modality representations, which provide diverse views to discover the latent image-text relationships. Furthermore, the progressive fusion module is performed to fuse inter-modality features stepwise, which fully uses the inherent complementary between different views. Extensive experiments on Flickr30K and MSCOCO verify the superiority of MIRPF compared with several existing approaches. The code is available at: https://github.com/jasscia18/MIRPF. (C) 2023 Published by Elsevier B.V.
What problem does this paper attempt to address?