Abstract:Recent advancements in image-text matching have been notable, yet prevailing models predominantly cater to broad queries and struggle with accommodating fine-grained query intention. In this paper, we work towards the \textbf{E}ntity-centric \textbf{I}mage-\textbf{T}ext \textbf{M}atching (EITM), a task that the text and image involve specific entity-related information. The challenge of this task mainly lies in the larger semantic gap in entity association modeling, comparing with the general image-text matching <a class="link-external link-http" href="http://problem.To" rel="external noopener nofollow">this http URL</a> narrow the huge semantic gap between the entity-centric text and the images, we take the fundamental CLIP as the backbone and devise a multimodal attentive contrastive learning framework to tam CLIP to adapt EITM problem, developing a model named EntityCLIP. The key of our multimodal attentive contrastive learning is to generate interpretive explanation text using Large Language Models (LLMs) as the bridge clues. In specific, we proceed by extracting explanatory text from off-the-shelf LLMs. This explanation text, coupled with the image and text, is then input into our specially crafted Multimodal Attentive Experts (MMAE) module, which effectively integrates explanation texts to narrow the gap of the entity-related text and image in a shared semantic space. Building on the enriched features derived from MMAE, we further design an effective Gated Integrative Image-text Matching (GI-ITM) strategy. The GI-ITM employs an adaptive gating mechanism to aggregate MMAE's features, subsequently applying image-text matching constraints to steer the alignment between the text and the image. Extensive experiments are conducted on three social media news benchmarks including N24News, VisualNews, and GoodNews, the results shows that our method surpasses the competition methods with a clear margin.

Turning a CLIP modal into image-text matching

Image–Text Matching Model Based on CLIP Bimodal Encoding

EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning

Turning a CLIP Model into a Scene Text Detector

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

A Joint Encoding Model for Image-Text Matching Based on CLIP

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Turning a CLIP Model into a Scene Text Spotter

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Modal Contrastive Learning based End-to-End Text Image Machine Translation

ITMix: Image-Text Mix Augmentation for Transferring CLIP to Image Classification

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Multimodal Multilabel Classification by CLIP

CLIP-Driven Fine-grained Text-Image Person Re-identification

CLIP-enhanced multimodal machine translation: integrating visual and label features with transformer fusion