Multi-step Self-attention Network for Cross-modal Retrieval Based on a Limited Text Space.

Zheng Yu,Wenmin Wang,Ge Li
DOI: https://doi.org/10.1109/icassp.2019.8682424
2019-01-01
Abstract:Cross-modal retrieval has been recently proposed to find an appropriate subspace where the similarity among different modalities, such as image and text, can be directly measured. In this paper, we propose Multi-step Self-Attention Network (MSAN) to perform cross-modal retrieval in a limited text space with multiple attention steps, that can selectively attend to partial shared information at each step and aggregate useful information over multiple steps to measure the final similarity. In order to achieve better retrieval results with faster training speed, we introduce global prior knowledge as the global reference information. Extensive experiments on Flickr30K and MSCOCO, show that MSAN achieves new state-of-the-art results in accuracy for cross-modal retrieval.
What problem does this paper attempt to address?