Abstract:The task of image–text matching refers to measuring the visual-semantic similarity between an image and a sentence. Recently, the fine-grained matching methods that explore the local alignment between the image regions and the sentence words have shown advance in inferring the image–text correspondence by aggregating pairwise region-word similarity. However, the local alignment is hard to achieve as some important image regions may be inaccurately detected or even missing. Meanwhile, some words with high-level semantics cannot be strictly corresponding to a single-image region. To tackle these problems, we address the importance of exploiting the global semantic consistence between image regions and sentence words as complementary for the local alignment. In this article, we propose a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image–text matching. The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence. It directly extracts semantic labels from available sentence corpus without additional labor cost, which further provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment. Extensive experiments on Flickr30k and Microsoft COCO (MSCOCO) data sets demonstrate the effectiveness of the proposed CASC on preserving global semantic consistence along with the local alignment and further show its superior image–text matching performance compared with more than 15 state-of-the-art methods.

Improving Image-Text Matching by Integrating Word Sense Disambiguation

Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

An End-to-End Image-Text Matching Approach Considering Semantic Uncertainty

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

A Unified Model for Word Sense Representation and Disambiguation.

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

Dual Semantic Relationship Attention Network for Image-Text Matching

Exploiting textual queries for dynamically visual disambiguation

Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information

Exploiting external knowledge sources to improve kernel-based Word Sense Disambiguation

Giving Text More Imagination Space for Image-text Matching

Cross-modal Semantic Interference Suppression for image-text matching

Word Sense Disambiguation using Knowledge-based Word Similarity

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Adaptive Latent Graph Representation Learning for Image-Text Matching

Word Sense Disambiguation Based on Positional Weighted Context

Enhancing Image-Text Matching with Adaptive Feature Aggregation

Exploring Entity-Level Spatial Relationships for Image-Text Matching

A New Fine-grained Alignment Method for Image-text Matching

Using BERT for Word Sense Disambiguation