Multi-Modal Memory Enhancement Attention Network for Image-Text Matching

Zhong Ji,Zhigang Lin,Haoran Wang,Yuqing He
DOI: https://doi.org/10.1109/access.2020.2975594
IF: 3.9
2020-01-01
IEEE Access
Abstract:Image-text matching is an attractive research topic in the community of vision and language. The key element to narrow the “heterogeneity gap” between visual and textual data lies in how to learn powerful and robust representations for both modalities. This paper proposes to alleviate this issue to achieve the fine-grained visual-textual alignment from two aspects: exploiting attention mechanism to locate the semantically meaningful portion and leveraging the memory network to capture the long-term contextual knowledge. Unlike most existing studies sorely focus on exploring the cross-modal associations at the fragment level, our designed Collaborative Dual Attention (CDA) module is able to model the semantic interdependencies from both perspectives of fragment and channel. Furthermore, considering the usage of long-term contextual knowledge contributes to compensate for detailed semantics concealed in the rarely appeared image-text pairs, we present to learn the joint representations by constructing a Multi-Modal Memory Enhancement (M3E) module. Specifically, it sequentially restores the intra-modal and multi-modal information into the memory items, and they conversely persistently memorize cross-modal shared semantics to improve the latent embeddings. By incorporating both CDA and M3E modules into a deep architecture, our approach generates more semantically consistent embeddings for representing images and texts. Extensive experiments demonstrate our model can achieve the state-of-the-art results on two public benchmark datasets.
What problem does this paper attempt to address?