Abstract:Image-text retrieval is a widely studied topic in the field of computer vision due to the exponential growth of multimedia data, whose core concept is to measure the similarity between images and text. However, most existing retrieval methods heavily rely on cross-attention mechanisms for cross-modal fine-grained alignment, which takes into account excessive irrelevant regions and treats prominent and non-significant words equally, thereby limiting retrieval accuracy. This paper aims to investigate an alignment approach that reduces the involvement of non-significant fragments in images and text while enhancing the alignment of prominent segments. For this purpose, we introduce the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which achieves improved retrieval accuracy by diminishing the participation of irrelevant regions during alignment and relatively increasing the alignment similarity of prominent words. Additionally, we incorporate prior textual information into image regions to reduce misalignment occurrences. In practice, we first design a novel intra-modal fragments relationship reasoning method, and subsequently employ our proposed alignment mechanism to compute the similarity between images and text. Extensive quantitative comparative experiments on MS-COCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the image - text retrieval task, existing methods overly rely on the cross - attention mechanism when performing cross - modal fine - grained alignment, which leads to excessive attention to unimportant regions and words, thus limiting the retrieval accuracy. Specifically, existing methods consider too many irrelevant regions during the alignment process and treat salient and non - salient words equally, which affects the final retrieval effect. To improve this problem, the paper proposes a new alignment method - **Cross - Modal Prominent Fragments Enhancement Aligning Network (CPFEAN)**. This method improves the retrieval accuracy by reducing the participation of unimportant fragments and enhancing the alignment of prominent fragments simultaneously. Specific improvement measures include: 1. **Reducing the influence of unimportant regions**: By fusing the matched text information into the image regions and using the gating mechanism to control information fusion, distinguish between salient regions and irrelevant regions. 2. **Highlighting prominent fragments**: During the alignment process, only consider the most prominent regions and ignore non - prominent regions. Meanwhile, select the most semantically relevant words in the text for cross - modal information fusion to ensure that prominent words have higher similarity during alignment. 3. **Introducing prior text information**: Extract the text labels of each region as prior information to enhance the consistency of semantic representation. For regions with semantic consistency with the prior information, increase the alignment similarity score; for misaligned regions, reduce the similarity score. Through these improvements, the experimental results of the method proposed in the paper on multiple datasets show that its performance is better than the existing state - of - the - art methods. In particular, on the MS - COCO and Flickr30K datasets, the rSum index is improved by approximately 5% - 10% respectively.

A New Fine-grained Alignment Method for Image-text Matching

Cross-modal Prominent Fragments Enhancement Aligning Network for Image-text Retrieval

Position Focused Attention Network for Image-Text Matching

Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

PFAN++: Bi-Directional Image-Text Retrieval with Position Focused Attention Network.

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching.

Re-ranking image-text matching by adaptive metric fusion

A Mutually Textual and Visual Refinement Network for Image-Text Matching

Align and Retrieve: Composition and Decomposition Learning in Image Retrieval with Text Feedback

Negative-Aware Attention Framework for Image-Text Matching

Similarity Reasoning and Filtration for Image-Text Matching

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Cross-modal alignment with graph reasoning for image-text retrieval

Reference-Aware Adaptive Network for Image-Text Matching

Improving Image-Text Matching with Bidirectional Consistency of Cross-Modal Alignment

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Image-Text Retrieval with Cross-Modal Semantic Importance Consistency.

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval