Relation-aware Multi-pass Comparison Deconfounded Network for Change Captioning
Zhicong Lu,Li Jin,Ziwei Chen,Changyuan Tian,Xian Sun,Xiaoyu Li,Yi Zhang,Qi Li,Guangluan Xu
DOI: https://doi.org/10.1109/tcsvt.2024.3445337
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Change captioning aims to describe the semantic change between a pair of images with natural language while remaining immune to viewpoint change. Based on the encoder-decoder architecture, most existing methods primarily focus on encoding effective change representations for transmission to the decoder. However, they suffer from an insufficient understanding of visual semantics, inadequate single-pass feature comparison, and a confounding bias caused by imbalanced viewpoint change data. These impair change representations and hinder unbiased caption generation. In this paper, we analyze and identify the confounding bias from a causality perspective and propose a Relation-aware Multi-pass Comparison Deconfounded (RMCD) network for change captioning, which elevates the encoding of change representations and mitigates the bias. Specifically, in the encoding stage, to sufficiently understand visual semantics, a position-guided context aggregating module is presented to capture the positional and contextual relations among objects in the image. Then, to achieve comprehensive change representations, we present a multi-pass feature comparison module to recognize semantic differences at various feature levels and progressively integrate them. In the decoding stage, to generate de-biased captions, the causal intervention is employed to remove the confounding bias which introduces spurious correlations between encoded change representations and captions. The newly achieved state-of-the-art performance on four publicly available benchmark datasets and further visual analysis demonstrate the superiority of our method.