Abstract:The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.

Multimodal Deep Embedding via Hierarchical Grounded Compositional Semantics.

Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Learning semantic sentence representations from visually grounded language without lexical knowledge

Multimodal Composition Example Mining for Composed Query Image Retrieval

Semantic Composition in Visually Grounded Language Models

Learning Unseen Concepts Via Hierarchical Decomposition and Composition

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Towards Semantic Embedding In Visual Vocabulary

Bridging Continuous and Discrete Spaces: Interpretable Sentence Representation Learning via Compositional Operations

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Learning Multi-Modal Word Representation Grounded in Visual Context

Structural Embedding of Syntactic Trees for Machine Comprehension

Semantic Compositional Networks for Visual Captioning

Multimodality-guided Visual-Caption Semantic Enhancement

Universal Multimodal Representation for Language Understanding

Composition Vision-Language Understanding via Segment and Depth Anything Model

Multimodal Sentiment Analysis Based on Composite Hierarchical Fusion

Model Composition for Multimodal Large Language Models