Abstract:Image-text retrieval is a fundamental cross-modal task that aims to align the representation spaces between the image and text modalities. Existing cross-modal image-text retrieval methods independently generate embeddings for images and text, introduce interaction-based networks for cross-modal inference, and then achieve retrieval by using matching metrics. However, they overlook the semantic relationship between the coarse-grained and fine-grained representations within each modality, failing to capture the consistency of representations across different modalities, which affects the semantic learning of cross-modal representations, and makes it difficult to align modalities in semantic space. Consequently, these previous works inevitably suffer from low retrieval accuracy or high computational costs. In this paper, instead of directly fusing two cross-modal het-erogeneous spaces, we propose an multimodal knowledge enhanced multimodal transformer network framework to combine coarse-grained and fine-grained representation learning into a unified framework, capturing alignment information between targets, constructing a global semantic graph, and ultimately align multimodal representations in the semantic space. In our approach, images generate semantic and spatial graphs to represent visual information, while sentences generate text graphs based on semantic relationships between words, and they are used for intra-modal graph network inference. Subsequently, the generated global and local embeddings are fused into an enhanced multimodal transformer framework, effectively imple-menting cross-modal interaction processes by leveraging prior implicit semantic information from the multimodal knowledge graph. Furthermore, compared to simply matching words with image regions, our method proposes a bidirectional fine-grained matching method to filter the salient regions and words of images and texts, remove the interfering noise information, and realize bidirectional fine-grained pairing, which captures fine-grained bi-directional representational information, thus enable the model to generate more discriminative representations Finally, equipped with a coarse-to-fine inference method based on hybrid global and local cross-modal similarities, we demonstrate that the proposed method is able to significantly outperform existing state-of-the-art algorithms by evaluating our method using two widely-used datasets.

Is Multi-Level Data Enhancement Helpful for Knowledge Graph? A New Perspective on Multimodal Fusion

MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion

Cross-Modal Knowledge Enhancement Mechanism for Few-Shot Learning

Multi-modal Recommendation Based on Knowledge Graph

Cross-Modal Knowledge Discovery, Inference, and Challenges.

Knowledge Graph Enhanced Multimodal Transformer for Image-Text Retrieval

Multi-Modal Siamese Network for Few-Shot Knowledge Graph Completion

Fusing Visual and Textual Content for Knowledge Graph Embedding Via Dual-Track Model

Multi-level Fusion of Multi-modal Semantic Embeddings for Zero Shot Learning

Is Visual Context Really Helpful for Knowledge Graph? A Representation Learning Perspective.

Multimodal Knowledge Graph-Guided Cross-Modal Graph Network for Image-Text Retrieval

Representation and Fusion Based on Knowledge Graph in Multi-Modal Semantic Communication

Multi-Graph Based Hierarchical Semantic Fusion for Cross-Modal Representation

Multimodal Graph Learning for Cross-Modal Retrieval.

Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion

MERGE: A Modal Equilibrium Relational Graph Framework for Multi-Modal Knowledge Graph Completion

Knowledge Graph Embedding Based on Multi-information Fusion

Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey

Multi-Modal Knowledge Hypergraph for Diverse Image Retrieval.

Multi-modal Graph Convolutional Network for Knowledge Graph Entity Alignment

Enhancing Recommender System with Multi-modal Knowledge Graph.