Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

Yunbin Tu,Liang Li,Li Su,Chenggang Yan,Qingming Huang

2024-07-16

Abstract:Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets. The code is available at <a class="link-external link-https" href="https://github.com/tuyunbin/DIRL" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the challenge of accurately describing semantic changes between two similar images in the task of change captioning, especially in the presence of distractors such as lighting and viewpoint variations. Specifically: 1. **Representation Learning under Distractors**: - A method called "Distractors-Immune Representation Learning" (DIRL) is proposed. This method associates corresponding channels in the representations of two images and decorrelates different channels to obtain stable image representations in a self-supervised manner. This ensures the stability and distinctiveness of the representations even in the presence of distractors. 2. **Cross-modal Contrastive Regularization**: - Cross-modal Contrastive Regularization (CCR) is designed to optimize cross-modal alignment by maximizing the contrastive alignment between generated vocabulary and the attended change features. This helps the decoder generate sentences based on the most relevant difference features. 3. **Experimental Validation**: - Extensive experiments were conducted on 4 public datasets, demonstrating that the proposed method outperforms existing methods in various change scenarios. Through these methods, the paper aims to improve the robustness and accuracy of models in the task of change captioning, particularly in the presence of distractors such as lighting and viewpoint variations.

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

Semantic Relation-aware Difference Representation Learning for Change Captioning

Context-aware Difference Distilling for Multi-change Captioning

Viewpoint-Adaptive Representation Disentanglement Network for Change Captioning

Caption Feature Space Regularization for Audio Captioning

Neighborhood Contrastive Transformer for Change Captioning

Self-supervised Cross-view Representation Reconstruction for Change Captioning

Robust Change Captioning

Inter-Temporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

Image Difference Captioning With Instance-Level Fine-Grained Feature Representation

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Relation-aware Multi-pass Comparison Deconfounded Network for Change Captioning

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning

A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning.

Bidirectional difference locating and semantic consistency reasoning for change captioning

SMART: Syntax-Calibrated Multi-Aspect Relation Transformer for Change Captioning.

Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning

See or Guess: Counterfactually Regularized Image Captioning

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset