Abstract:Cross-modality human behavior analysis has attracted much attention from both academia and industry. In this article, we focus on the cross-modality image-text retrieval problem for human behavior analysis, which can learn a common latent space for cross-modality data and thus benefit the understanding of human behavior with data from different modalities. Existing state-of-the-art cross-modality image-text retrieval models tend to be fine-grained region-word matching approaches, where they begin with measuring similarities for each image region or text word followed by aggregating them to estimate the global image-text similarity. However, it is observed that such fine-grained approaches often encounter the similarity bias problem, because they only consider matched text words for an image region or matched image regions for a text word for similarity calculation, but they totally ignore unmatched words/regions, which might still be salient enough to affect the global image-text similarity. In this article, we propose an Adaptive Confidence Matching Network (ACMNet), which is also a fine-grained matching approach, to effectively deal with such a similarity bias. Apart from calculating the local similarity for each region(/word) with its matched words(/regions), ACMNet also introduces a confidence score for the local similarity by leveraging the global text(/image) information, which is expected to help measure the semantic relatedness of the region(/word) to the whole text(/image). Moreover, ACMNet also incorporates the confidence scores together with the local similarities in estimating the global image-text similarity. To verify the effectiveness of ACMNet, we conduct extensive experiments and make comparisons with state-of-the-art methods on two benchmark datasets, i.e., Flickr30k and MS COCO. Experimental results show that the proposed ACMNet can outperform the state-of-the-art methods by a clear margin, which well demonstrates the effectiveness of the proposed ACMNet in human behavior analysis and the reasonableness of tackling the mentioned similarity bias issue.

Dual Semantic Relationship Attention Network for Image-Text Matching

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

ACMNet

Multi-Modality Cross Attention Network for Image and Sentence Matching

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching

Visual-Semantic Matching by Exploring High-Order Attention and Distraction

Reference-Aware Adaptive Network for Image-Text Matching

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Learning Semantic Relationship among Instances for Image-Text Matching

Attend, Correct and Focus: A Bidirectional Correct Attention Network for Image-Text Matching

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching.

Image-Text Matching with Multi-View Attention

Composing Object Relations and Attributes for Image-Text Matching

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

Giving Text More Imagination Space for Image-text Matching