Abstract:Cross-modality human behavior analysis has attracted much attention from both academia and industry. In this article, we focus on the cross-modality image-text retrieval problem for human behavior analysis, which can learn a common latent space for cross-modality data and thus benefit the understanding of human behavior with data from different modalities. Existing state-of-the-art cross-modality image-text retrieval models tend to be fine-grained region-word matching approaches, where they begin with measuring similarities for each image region or text word followed by aggregating them to estimate the global image-text similarity. However, it is observed that such fine-grained approaches often encounter the similarity bias problem, because they only consider matched text words for an image region or matched image regions for a text word for similarity calculation, but they totally ignore unmatched words/regions, which might still be salient enough to affect the global image-text similarity. In this article, we propose an Adaptive Confidence Matching Network (ACMNet), which is also a fine-grained matching approach, to effectively deal with such a similarity bias. Apart from calculating the local similarity for each region(/word) with its matched words(/regions), ACMNet also introduces a confidence score for the local similarity by leveraging the global text(/image) information, which is expected to help measure the semantic relatedness of the region(/word) to the whole text(/image). Moreover, ACMNet also incorporates the confidence scores together with the local similarities in estimating the global image-text similarity. To verify the effectiveness of ACMNet, we conduct extensive experiments and make comparisons with state-of-the-art methods on two benchmark datasets, i.e., Flickr30k and MS COCO. Experimental results show that the proposed ACMNet can outperform the state-of-the-art methods by a clear margin, which well demonstrates the effectiveness of the proposed ACMNet in human behavior analysis and the reasonableness of tackling the mentioned similarity bias issue.

Cross-modal Active Complementary Learning with Self-refining Correspondence

Cross-Modal Retrieval With Noisy Correspondence via Consistency Refining and Mining

ACMNet

NAC: Mitigating Noisy Correspondence in Cross-Modal Matching Via Neighbor Auxiliary Corrector.

Learning with Noisy Correspondence

Learning From Noisy Correspondence With Tri-Partition for Cross-Modal Matching

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Adversarial Complementary Learning for Multisource Remote Sensing Classification

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network.

PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval

Noisy Correspondence Learning with Meta Similarity Correction

Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Cross-Modal Alternating Learning with Task-Aware Representations for Continual Learning

Cross-Modal Retrieval with Partially Mismatched Pairs

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

CLImage: Human-Annotated Datasets for Complementary-Label Learning

Adaptive Contrastive Learning for Learning Robust Representations under Label Noise.