Knowledge Graph Enhanced Multimodal Transformer for Image-Text Retrieval
Juncheng Zheng,Meiyu Liang,Yang Yu,Yawen Li,Zhe Xue
DOI: https://doi.org/10.1109/icde60146.2024.00013
2024-01-01
Abstract:Image-text retrieval is a fundamental cross-modal task that aims to align the representation spaces between the image and text modalities. Existing cross-modal image-text retrieval methods independently generate embeddings for images and text, introduce interaction-based networks for cross-modal inference, and then achieve retrieval by using matching metrics. However, they overlook the semantic relationship between the coarse-grained and fine-grained representations within each modality, failing to capture the consistency of representations across different modalities, which affects the semantic learning of cross-modal representations, and makes it difficult to align modalities in semantic space. Consequently, these previous works inevitably suffer from low retrieval accuracy or high computational costs. In this paper, instead of directly fusing two cross-modal het-erogeneous spaces, we propose an multimodal knowledge enhanced multimodal transformer network framework to combine coarse-grained and fine-grained representation learning into a unified framework, capturing alignment information between targets, constructing a global semantic graph, and ultimately align multimodal representations in the semantic space. In our approach, images generate semantic and spatial graphs to represent visual information, while sentences generate text graphs based on semantic relationships between words, and they are used for intra-modal graph network inference. Subsequently, the generated global and local embeddings are fused into an enhanced multimodal transformer framework, effectively imple-menting cross-modal interaction processes by leveraging prior implicit semantic information from the multimodal knowledge graph. Furthermore, compared to simply matching words with image regions, our method proposes a bidirectional fine-grained matching method to filter the salient regions and words of images and texts, remove the interfering noise information, and realize bidirectional fine-grained pairing, which captures fine-grained bi-directional representational information, thus enable the model to generate more discriminative representations Finally, equipped with a coarse-to-fine inference method based on hybrid global and local cross-modal similarities, we demonstrate that the proposed method is able to significantly outperform existing state-of-the-art algorithms by evaluating our method using two widely-used datasets.