Fine-Grained Cross-Modal Retrieval with Triple-Streamed Memory Fusion Transformer Encoder

Weikuo Guo,Huaibo Huang,Xiangwei Kong,Ran He
DOI: https://doi.org/10.1109/ICME52920.2022.9859738
2022-01-01
Abstract:Recently, the powerful attention mechanism has been wildly used to learn the fine-grained cross-modal correspondences. However, the trade-off between effectiveness and efficiency sometimes bothers existing attention-mechanism-based meth-ods. To address this deficiency, we propose a novel Triple-streamed architecture with a newly designed Memory fusion Transformer Encoder (Tri-MTE) for fine-grained cross-modal retrieval. Specifically, the whole model reserves the “late fusion” strategy thus ensuring efficiency. To strengthen the inter-modality interaction and improve the effectiveness, a memory fusion stream is designed and inserted between the modality streams to remember the modality-irrelevant infor-mation. Encoding such information to the modality represen-tation would significantly enhance the cross-modal retrieval performance. Finally, a bionic memory activation constrain-t is proposed to aid the learning procedure. Extensive ex-periments on two benchmark datasets show that the proposed method achieves promising results.
What problem does this paper attempt to address?