Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Xuri Ge,Fuhai Chen,Joemon M. Jose,Zhilong Ji,Zhongqin Wu,Xiao Liu
DOI: https://doi.org/10.1145/3474085.3475634
2021-08-05
Abstract:The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively align the semantic and structural relationships between visual and textual modalities in the image - sentence retrieval task. The current state - of - the - art image - sentence retrieval methods implicitly align regions in the image with words in the sentence and adopt attention modules to highlight the relevance of cross - modal semantic correspondence. However, due to the lack of a consistent representation in the semantic and structural space, the retrieval performance is still not satisfactory. This paper proposes a novel Structured Multimodal Feature Embedding and Alignment (SMFEA) model, aiming to solve the above problems from two aspects: (i) constructing the internal structure (including relationships) among the fragments of their respective modalities, such as the semantic structure of "dog → play → ball" in the image; (ii) finding explicit cross - modal structural and semantic correspondence relationships between visual and textual modalities. Specifically, SMFEA jointly and explicitly learns visual - text embedding and cross - modal alignment by creating novel multimodal structure modules with shared context - aware reference trees. In particular, by constructing a Visual Context - aware Structure - Tree Encoder (VCS - Tree) and a Text Context - aware Structure - Tree Encoder (TCS - Tree) with shared labels, visual and text features can be jointly learned and optimized. Using the multimodal tree structure, heterogeneous image - sentence data are explicitly aligned by maximizing the semantic and structural similarities between corresponding cross - modal tree nodes. Experimental results show that the proposed model has superiority over existing methods in the Microsoft COCO and Flickr30K benchmarks.