Abstract:The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Multi‐scale Cross‐domain Alignment for Person Image Generation

A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval

Multi-Modality Cross Attention Network for Image and Sentence Matching

Multi-scale Matching Networks for Semantic Correspondence

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching

A New Fine-grained Alignment Method for Image-text Matching

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Graph Structured Network for Image-Text Matching

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Multi-level network based on transformer encoder for fine-grained image–text matching

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Matching Image and Sentence with Multi-Faceted Representations

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

Chinese Semantic Matching with Multi-granularity Alignment and Feature Fusion

Multi-granularity Correlation Refinement for Semantic Correspondence

Frame-based Multi-level Semantics Representation for text matching

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval

Similarity Reasoning and Filtration for Image-Text Matching