Abstract:Cross-modality human behavior analysis has attracted much attention from both academia and industry. In this article, we focus on the cross-modality image-text retrieval problem for human behavior analysis, which can learn a common latent space for cross-modality data and thus benefit the understanding of human behavior with data from different modalities. Existing state-of-the-art cross-modality image-text retrieval models tend to be fine-grained region-word matching approaches, where they begin with measuring similarities for each image region or text word followed by aggregating them to estimate the global image-text similarity. However, it is observed that such fine-grained approaches often encounter the similarity bias problem, because they only consider matched text words for an image region or matched image regions for a text word for similarity calculation, but they totally ignore unmatched words/regions, which might still be salient enough to affect the global image-text similarity. In this article, we propose an Adaptive Confidence Matching Network (ACMNet), which is also a fine-grained matching approach, to effectively deal with such a similarity bias. Apart from calculating the local similarity for each region(/word) with its matched words(/regions), ACMNet also introduces a confidence score for the local similarity by leveraging the global text(/image) information, which is expected to help measure the semantic relatedness of the region(/word) to the whole text(/image). Moreover, ACMNet also incorporates the confidence scores together with the local similarities in estimating the global image-text similarity. To verify the effectiveness of ACMNet, we conduct extensive experiments and make comparisons with state-of-the-art methods on two benchmark datasets, i.e., Flickr30k and MS COCO. Experimental results show that the proposed ACMNet can outperform the state-of-the-art methods by a clear margin, which well demonstrates the effectiveness of the proposed ACMNet in human behavior analysis and the reasonableness of tackling the mentioned similarity bias issue.

Giving Text More Imagination Space for Image-text Matching

ACMNet

Diversified text-to-image generation via deep mutual information estimation

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Image–Text Matching Model Based on CLIP Bimodal Encoding

Dual Semantic Relationship Attention Network for Image-Text Matching

Reference-Aware Adaptive Network for Image-Text Matching

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching.

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Information Theoretic Text-to-Image Alignment

Image-Text Matching with Multi-View Attention

A Mutually Textual and Visual Refinement Network for Image-Text Matching

Multimodal Sentiment Analysis With Image-Text Interaction Network

EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning

An End-to-End Image-Text Matching Approach Considering Semantic Uncertainty

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Active Mining Sample Pair Semantics for Image-text Matching

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

Multi-level network based on transformer encoder for fine-grained image–text matching

A New Fine-grained Alignment Method for Image-text Matching