Abstract:Extracting semantically consistent representations from multi-modal data helps computers understand the human world more comprehensively. Visual-semantic matching, as one of the fundamental tasks for multi-modal learning, attracts continuous attention. Recent research makes unflagging endeavors to enhance the matching performance, but sometimes at the expense of overlooking the delicate balance between efficiency and effectiveness. In this paper, we aim to address this dilemma through a newly proposed attention-mechanism-based architecture. To ensure optimal effectiveness, we adopt a more advanced Transformer Encoder (TE) as our basic model and introduce two significant ameliorations to tailor it for the visual-semantic matching task. Initially, we incorporate fine-grained supervision into the classic TE, allowing our model to capture sophisticated correspondences between different modalities. Subsequently, we employ a dynamic attention-evolving strategy to selectively pass useful information and strengthen the attention pattern consistency between adjacent TE blocks. To maintain efficiency, we propose a novel Select & Re-rank strategy that enables the model to ignore redundant information. This approach significantly reduces the computational cost and increases the matching speed with relatively minimal performance degradation. The proposed architecture can gradually capture and reorganize useful information from inter-modality as well as intra-modality under the supervision of both fine-grained and global similarity, which leads to more comprehensive and discriminative embeddings. Experiments on two benchmark datasets show that the proposed method achieves competitive results in terms of both efficiency and effectiveness.

Dual Relation-Aware Synergistic Attention Network for Image-Text Matching

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Dual Semantic Relationship Attention Network for Image-Text Matching

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Dual-path Rare Content Enhancement Network for Image and Text Matching

Similarity Reasoning and Filtration for Image-Text Matching

Reference-Aware Adaptive Network for Image-Text Matching

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification

Visual-Semantic Matching by Exploring High-Order Attention and Distraction

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Learning Aligned Image-Text Representations Using Graph Attentive Relational Network

Composing Object Relations and Attributes for Image-Text Matching

Dual-Level Representation Enhancement on Characteristic and Context for Image-Text Retrieval

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Attend, Correct and Focus: A Bidirectional Correct Attention Network for Image-Text Matching