Abstract:Image-text retrieval is a fundamental cross-modal task that aims to align the representation spaces between the image and text modalities. Existing cross-modal image-text retrieval methods independently generate embeddings for images and text, introduce interaction-based networks for cross-modal inference, and then achieve retrieval by using matching metrics. However, they overlook the semantic relationship between the coarse-grained and fine-grained representations within each modality, failing to capture the consistency of representations across different modalities, which affects the semantic learning of cross-modal representations, and makes it difficult to align modalities in semantic space. Consequently, these previous works inevitably suffer from low retrieval accuracy or high computational costs. In this paper, instead of directly fusing two cross-modal het-erogeneous spaces, we propose an multimodal knowledge enhanced multimodal transformer network framework to combine coarse-grained and fine-grained representation learning into a unified framework, capturing alignment information between targets, constructing a global semantic graph, and ultimately align multimodal representations in the semantic space. In our approach, images generate semantic and spatial graphs to represent visual information, while sentences generate text graphs based on semantic relationships between words, and they are used for intra-modal graph network inference. Subsequently, the generated global and local embeddings are fused into an enhanced multimodal transformer framework, effectively imple-menting cross-modal interaction processes by leveraging prior implicit semantic information from the multimodal knowledge graph. Furthermore, compared to simply matching words with image regions, our method proposes a bidirectional fine-grained matching method to filter the salient regions and words of images and texts, remove the interfering noise information, and realize bidirectional fine-grained pairing, which captures fine-grained bi-directional representational information, thus enable the model to generate more discriminative representations Finally, equipped with a coarse-to-fine inference method based on hybrid global and local cross-modal similarities, we demonstrate that the proposed method is able to significantly outperform existing state-of-the-art algorithms by evaluating our method using two widely-used datasets.

Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion

Knowledge Representation Learning with Entity Descriptions, Hierarchical Types, and Textual Relations

Multi-hop neighbor fusion enhanced hierarchical transformer for multi-modal knowledge graph completion

MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features

TE-TFN: A Text-enhanced Transformer Fusion Network for Multimodal Knowledge Graph Completion

MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion

MERGE: A Modal Equilibrium Relational Graph Framework for Multi-Modal Knowledge Graph Completion

Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning

Relation Extraction with Knowledge-Enhanced Prompt-Tuning on Multimodal Knowledge Graph

Simple Yet Effective: Structure Guided Pre-trained Transformer for Multi-modal Knowledge Graph Reasoning

Structure Pre-training and Prompt Tuning for Knowledge Graph Transfer

MTKGCformer: A Multi-train Transformer-based Representation Learning for Knowledge Graph Completion Task

MMKGR: Multi-hop Multi-modal Knowledge Graph Reasoning

Knowledge Graph Completion Via Multi-feature Learning

The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation Framework.

Knowledge Graph Enhanced Multimodal Transformer for Image-Text Retrieval

Mmformer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation

Knowledge Graph Completion with Pre-trained Multimodal Transformer and Twins Negative Sampling

Multi-Modal Siamese Network for Few-Shot Knowledge Graph Completion

Unleashing the Power of Imbalanced Modality Information for Multi-modal Knowledge Graph Completion

HKA: A Hierarchical Knowledge Alignment Framework for Multimodal Knowledge Graph Completion