Self-supervised Cross-view Representation Reconstruction for Change Captioning

Yunbin Tu,Liang Li,Li Su,Zheng-Jun Zha,Chenggang Yan,Qingming Huang
DOI: https://doi.org/10.48550/arXiv.2309.16283
2023-09-28
Abstract:Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a ``hallucination'' representation with the caption and ``before'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at <a class="link-external link-https" href="https://github.com/tuyunbin/SCORER" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?