Abstract:Change captioning aims to describe the semantic change between a pair of images with natural language while remaining immune to viewpoint change. Based on the encoder-decoder architecture, most existing methods primarily focus on encoding effective change representations for transmission to the decoder. However, they suffer from an insufficient understanding of visual semantics, inadequate single-pass feature comparison, and a confounding bias caused by imbalanced viewpoint change data. These impair change representations and hinder unbiased caption generation. In this paper, we analyze and identify the confounding bias from a causality perspective and propose a Relation-aware Multi-pass Comparison Deconfounded (RMCD) network for change captioning, which elevates the encoding of change representations and mitigates the bias. Specifically, in the encoding stage, to sufficiently understand visual semantics, a position-guided context aggregating module is presented to capture the positional and contextual relations among objects in the image. Then, to achieve comprehensive change representations, we present a multi-pass feature comparison module to recognize semantic differences at various feature levels and progressively integrate them. In the decoding stage, to generate de-biased captions, the causal intervention is employed to remove the confounding bias which introduces spurious correlations between encoded change representations and captions. The newly achieved state-of-the-art performance on four publicly available benchmark datasets and further visual analysis demonstrate the superiority of our method.

Mitigating Dataset Bias in Image Captioning Through Clip Confounder-Free Captioning Network

Deconfounded Image Captioning: A Causal Retrospect

A Survey on Causal Inference in Image Captioning

Mitigating Gender Bias in Captioning Systems

See or Guess: Counterfactually Regularized Image Captioning

Fine-grained Image Captioning with CLIP Reward

Relation-aware Multi-pass Comparison Deconfounded Network for Change Captioning

Improving Multimodal Datasets with Image Captioning

DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning

Towards Deconfounded Image-Text Matching with Causal Inference

Contextual Debiasing for Visual Recognition with Causal Mechanisms

Image Captioning Based on Adaptive Balancing Loss.

Improving Image Captioning with Better Use of Caption

Language-guided Detection and Mitigation of Unknown Dataset Bias

CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Exploring Discrete Diffusion Models for Image Captioning

Visually-Aware Context Modeling for News Image Captioning

Uncurated Image-Text Datasets: Shedding Light on Demographic Bias

ClipCap: CLIP Prefix for Image Captioning

Image Captioning with a Constraint of Image-to-Text Transformation