Feature First: Advancing Image-Text Retrieval Through Improved Visual Features

Dongqing Wu,Huihui Li,Cang Gu,Hang Liu,Cuili Xu,Yinxuan Hou,Lei Guo
DOI: https://doi.org/10.1109/tmm.2023.3316077
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Current image-text retrieval methods mainly utilize region features that provide object-level information to represent images, making the retrieval results more accurate and interpretable. However, there are several issues with region features, such as lack of rich contextual information, loss of object details and risk of detection redundancy. The ideal visual features in image-text retrieval should have three characteristics: object-level, semantically-rich, and language-aligned. To this end, we propose a novel visual representation framework to capture more comprehensive and powerful visual features. Specifically, since these region feature disadvantages are the grid feature advantages, we first build a two-step interaction model to explore the complex relationship between them from the spatial and semantic perspectives to integrate their complementary information, making the fused visual features both object-level and semantic-rich. Then, we design a text-integrated visual embedding module that utilizes textual information as guidance to filter redundant regions, further endowing visual features with language-aligned capabilities. Finally, we develop a multi-attention pooling module to better aggregate these enhanced visual features in a more fine-grained manner. Extensive experiments demonstrate that our proposed model achieves state-of-the-art performance on the benchmark datasets Flickr30K and MS-COCO.
What problem does this paper attempt to address?