Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Xuri Ge,Fuhai Chen,Joemon M. Jose,Zhilong Ji,Zhongqin Wu,Xiao Liu

DOI: https://doi.org/10.1145/3474085.3475634

2021-08-05

Abstract:The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively align the semantic and structural relationships between visual and textual modalities in the image - sentence retrieval task. The current state - of - the - art image - sentence retrieval methods implicitly align regions in the image with words in the sentence and adopt attention modules to highlight the relevance of cross - modal semantic correspondence. However, due to the lack of a consistent representation in the semantic and structural space, the retrieval performance is still not satisfactory. This paper proposes a novel Structured Multimodal Feature Embedding and Alignment (SMFEA) model, aiming to solve the above problems from two aspects: (i) constructing the internal structure (including relationships) among the fragments of their respective modalities, such as the semantic structure of "dog → play → ball" in the image; (ii) finding explicit cross - modal structural and semantic correspondence relationships between visual and textual modalities. Specifically, SMFEA jointly and explicitly learns visual - text embedding and cross - modal alignment by creating novel multimodal structure modules with shared context - aware reference trees. In particular, by constructing a Visual Context - aware Structure - Tree Encoder (VCS - Tree) and a Text Context - aware Structure - Tree Encoder (TCS - Tree) with shared labels, visual and text features can be jointly learned and optimized. Using the multimodal tree structure, heterogeneous image - sentence data are explicitly aligned by maximizing the semantic and structural similarities between corresponding cross - modal tree nodes. Experimental results show that the proposed model has superiority over existing methods in the Microsoft COCO and Flickr30K benchmarks.

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

Cross-Modal Image-Text Retrieval with Semantic Consistency

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Multi-view and region reasoning semantic enhancement for image-text retrieval

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Cross-modal alignment with graph reasoning for image-text retrieval

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval