Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Chenyang Liu,Rui Zhao,Hao Chen,Zhengxia Zou,Zhenwei Shi
DOI: https://doi.org/10.1109/tgrs.2022.3218921
IF: 8.2
2022-11-15
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Analyzing land cover changes with multitemporal remote sensing (RS) images is crucial for environmental protection and land planning. In this article, we explore RS image change captioning (RSICC), a new task aiming at generating human-like language descriptions for the land cover changes in multitemporal RS images. We propose a novel Transformer-based RSICC (RSICCformer) model. It consists of three main components: 1) a CNN-based feature extractor to generate high-level features of RS image pairs; 2) a dual-branch Transformer encoder (DTE) to improve the feature discrimination capacity for the changes; and 3) a caption decoder to generate sentences describing the differences. The DTE consists of a hierarchy of processing stages to capture and recognize multiple changes of interest. Concretely, we use the bitemporal feature differences as keys to enhance image features (queries) from each temporal image in the dual-branch Transformer encoder (DTE). To explore the RSICC task, we build a large-scale dataset named LEVIR-CC, which contains 10077 pairs of bitemporal RS images and 50385 sentences describing the differences between images. We benchmark existing state-of-the-art synthetic image change captioning methods on the LEVIR Change Captioning dataset (LEVIR-CC dataset), and our RSICCformer outperforms previous methods with a significant margin (+4.98% on BLEU-4 and +9.86% on CIDEr-D). The attention visualization results also suggest that our model can focus on changes of interest and ignore irrelevant changes.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?