Abstract:Cross-modality human behavior analysis has attracted much attention from both academia and industry. In this article, we focus on the cross-modality image-text retrieval problem for human behavior analysis, which can learn a common latent space for cross-modality data and thus benefit the understanding of human behavior with data from different modalities. Existing state-of-the-art cross-modality image-text retrieval models tend to be fine-grained region-word matching approaches, where they begin with measuring similarities for each image region or text word followed by aggregating them to estimate the global image-text similarity. However, it is observed that such fine-grained approaches often encounter the similarity bias problem, because they only consider matched text words for an image region or matched image regions for a text word for similarity calculation, but they totally ignore unmatched words/regions, which might still be salient enough to affect the global image-text similarity. In this article, we propose an Adaptive Confidence Matching Network (ACMNet), which is also a fine-grained matching approach, to effectively deal with such a similarity bias. Apart from calculating the local similarity for each region(/word) with its matched words(/regions), ACMNet also introduces a confidence score for the local similarity by leveraging the global text(/image) information, which is expected to help measure the semantic relatedness of the region(/word) to the whole text(/image). Moreover, ACMNet also incorporates the confidence scores together with the local similarities in estimating the global image-text similarity. To verify the effectiveness of ACMNet, we conduct extensive experiments and make comparisons with state-of-the-art methods on two benchmark datasets, i.e., Flickr30k and MS COCO. Experimental results show that the proposed ACMNet can outperform the state-of-the-art methods by a clear margin, which well demonstrates the effectiveness of the proposed ACMNet in human behavior analysis and the reasonableness of tackling the mentioned similarity bias issue.

Visible/Infrared Image Registration Based on Region-Adaptive Contextual Multifeatures.

Visible/Infrared Image Registration Based on Region-Adaptive Contextual Multi-Features

ACMNet

Cross-Modality Image Matching Network with Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets

Towards RGB-NIR Cross-modality Image Registration and Beyond

Selective Context Network with Neighbourhood Consensus for Aerial Image Registration

Multi-Modal Image Registration Based on Local Self-Similarity and Bidirectional Matching

Multimodal Remote Sensing Image Matching via Learning Features and Attention Mechanism

Object Matching of Visible-Infrared Image Based on Attention Mechanism and Feature Fusion

General cross-modality registration framework for visible and infrared UAV target image registration

Visible-infrared image patch matching based on attention mechanism

Visible-infrared Image Matching Based on Parameter-Free Attention Mechanism and Target-Aware Graph Attention Mechanism

A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets Via Image Style Transfer

Infrared and Visible Image Registration Based on Scale-Invariant PIIFD Feature and Locality Preserving Matching.

Visible and Infrared Image Registration Based on Region Features and Edginess

Cross-Domain Co-Occurring Feature for Visible-Infrared Image Matching.

Unsupervised Misaligned Infrared and Visible Image Fusion via Cross-Modality Image Generation and Registration

CMFA_Net: A cross-modal feature aggregation network for infrared-visible image fusion

A Semi-Supervised Image Registration Framework Based on Multimodal Cross-Attention

Interpretable Multi-Modal Image Registration Network Based on Disentangled Convolutional Sparse Coding

Registration of Infrared and Visible Image Based on OpenCV