Reference-Aware Adaptive Network for Image-Text Matching

Guoxin Xiong,Meng,Tianzhu Zhang,Dongming Zhang,Yongdong Zhang
DOI: https://doi.org/10.1109/tcsvt.2024.3392619
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Image-text matching aims to bridge vision and language areas, which is a crucial task in multi-modal intelligence. The core idea is to learn features of each modality and aggregate learned features as holistic representations to measure image-text relevance. Most existing methods involve cross-modal interaction during feature learning by modeling fine-grained relationships between two modalities for better results. However, these methods may obtain wrong attention scores when directly computing similarities between regions and words. Besides, current methods mainly rely on simple pooling operations for feature aggregation, which introduces interference from redundant information, resulting in inaccurate matching results. To alleviate these issues, we propose a novel reference-aware adaptive network for image-text matching by jointly using a reference attention module for feature learning and an adaptive aggregation module for feature aggregation. The proposed model enjoys several merits. First, the designed reference attention module effectively reduces wrong attention scores by introducing a set of references during cross-modal interaction. Second, the proposed adaptive aggregation module highlights useful information adaptively while suppressing redundant information during aggregation. Extensive experiments on two standard benchmarks demonstrate that our method performs favorably against state-of-the-art methods.
What problem does this paper attempt to address?