TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning

Yinan Wu,Lingling Li,Licheng Jiao,Fang Liu,Xu Liu,Shuyuan Yang
DOI: https://doi.org/10.1109/tgrs.2024.3475633
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Remote sensing image captioning is an interesting but challenging cross-modal reasoning task for computer vision and natural language processing. Most of the recent popular approaches for remote sensing image captioning utilize encoder-decoder architectures which focus on visual features captured by CNN-based encoder and semantic information by RNN-based or LSTM-based decoder, but encounter difficulties with multi-scale, multi-categories, and direction ambiguity challenges. To make the most of semantic understanding ability of Transformers, in this paper, we propose a new attention-based visual-linguistic reasoning framework with dual Transformer for remote sensing image captioning. Specifically, Swin Transformer encoder with shifted window partitioning scheme is introduced for multi-scale visual feature extraction to discover the intrinsic relationship in the objects, and then a Transformer language model with self-attention and cross-attention is designed as the decoder to generate a well-formed sentence for the image. Extensive experiments are conducted on the public remote sensing image captioning benchmark datasets, including UCM-Captions, Sydney-Captions, and RSICD. The impressive performance verifies the effectiveness and superiority of the proposed method. Additionally, the source code and models of this work are publicly available at https://github.com/LianYi233/TrTr-CMR.
What problem does this paper attempt to address?