Joint Intra & Inter-Grained Reasoning: A New Look into Semantic Consistency of Image-Text Retrieval

Renjie Pan,Hua Yang,Cunyan Li,Jinhai Yang
DOI: https://doi.org/10.1109/tmm.2023.3327645
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Multimodal understanding aims at constructing semantic correlations among modalities of data while performing various downstream tasks. As one of the primary multimodal downstream tasks, image-text retrieval imposes a high demand on semantic alignment because of the independent expression paradigms of images and text. Existing methods mainly construct a joint embedding space at a single granularity level (either global or local). However, such single reasoning paradigms lack granularity interaction, resulting in semantic inconsistency and cross-domain catastrophes. To address these issues, we design a novel Joint Intra and Inter-grained Network (JIIGNet), focusing on not only intra- but also inter-grained interaction between modalities by combining scene information (global) with region-level (local) instances. Specifically, we simultaneously initiate three specific alignment modules, i.e., global-grained, local-grained, and cross-grained alignment modules, followed by Triplet Attention Refinement to better refine the fused embedding at the alignment-level with proper self and cross attention. For different scenarios, a Style Adaptation Head is further designed to smartly accommodate different samples. We validate JIIGNet through extensive experiments conducted on two widely used datasets: Flickr-30 K and MS-COCO, demonstrating the effectiveness of our proposed method.
What problem does this paper attempt to address?