Self-attention Guided Representation Learning for Image-Text Matching.

Xuefei Qi,Ying Zhang,Jinqing Qi,Huchuan Lu
DOI: https://doi.org/10.1016/j.neucom.2021.03.129
IF: 6
2021-01-01
Neurocomputing
Abstract:Image-text matching plays an important role in bridging vision and language. Most existing research works embed both images and sentences into a joint latent space to measure their similarities. However, these methods failed to either exploit the interactions between sentences and images, or distinguish the importance of different elements within each modality. In this paper, we propose a self-attention guided representation (SGR) learning model, which incorporates the guidance of self-attention mechanism into cross-attention representation learning module for image-text matching. Specifically, we introduce a self-attention mechanism to discriminate the importance of different words within a sentence, as well as that of different regions within an image. The representations associated with each modality are then fed into a cross-attention module to discover distinct alignments between words and regions. Moreover, we employ Term Frequency - Inverse Document Frequency (TF-IDF) to highlight the words that function prominently in textual descriptions, paving the way to accurate image-text matching results. Extensive experiments on datasets MSCOCO and Flickr30K demonstrate the effectiveness and superiority of our framework.
What problem does this paper attempt to address?