Abstract:Image-text matching is vital important in the field of multi-modal intelligence. Recently, it is advocated in a way that decomposes images and texts into local fragments and followed by region-word aligning. As a result, the image-text relevance score is given by aggregating semantic similarities between matched region-word pairs. Despite effectiveness, this strategy fails to express data relations exactly. From the perspective of the text side, text words decomposed from a concise language sentence usually have limited contextual information, which can result in semantic identical but actually false text-region alignments. From the perspective of the image side, semantic ambiguity that multiple objects share the same semantic meaning can further exacerbate this problem. In this manuscript, we introduce a mutually Textual and Visual Refinement Network (TVRN), to tackle the inaccurate cross-modal alignment problem. In a nutshell, TVRN improves inter-modal matching by improving contextual information in sentences meanwhile reduces semantic ambiguity in images to capture the maximized relevant relations. More specifically, we develop a new module that integrates visual contextual clues into the text modality to generate informational text features with richer geometric contexts. Mutually, we further design a semantic alignment enhancement module that leverages consensus affinity of local image and text features to guide deeper semantic image embedding with the supervision of global image vectors. At the image-text matching stage, similarities at the local and global levels are integrated to capture coarse-grained and fine-grained interactions between vision and language. A large number of experiments on Flickr30K and MS-COCO benchmarks demonstrate that TVRN is superior to existing methods.

A Multiview Text Imagination Network Based on Latent Alignment for Image-Text Matching

Giving Text More Imagination Space for Image-text Matching

MIGT: Multi-modal Image Inpainting Guided with Text.

Image-Text Matching with Multi-View Attention

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Dual Semantic Relationship Attention Network for Image-Text Matching

A Mutually Textual and Visual Refinement Network for Image-Text Matching

Multi-level network based on transformer encoder for fine-grained image–text matching

Cross-modal Graph Matching Network for Image-text Retrieval

MiC: Image-text Matching in Circles with cross-modal generative knowledge enhancement

Multiview adaptive attention pooling for image-text retrieval

Improving Image-Text Matching by Integrating Word Sense Disambiguation

Graph Structured Network for Image-Text Matching

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Image–Text Matching Model Based on CLIP Bimodal Encoding

Adaptive Latent Graph Representation Learning for Image-Text Matching

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching.

Learning Aligned Image-Text Representations Using Graph Attentive Relational Network

Multimodal Sentiment Analysis With Image-Text Interaction Network

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Reference-Aware Adaptive Network for Image-Text Matching