Multi-label Semantic Feature Fusion for Remote Sensing Image Captioning

Shuang Wang,Xiutiao Ye,Yu Gu,Jihui Wang,Yun Meng,Jingxian Tian,Biao Hou,Licheng Jiao
DOI: https://doi.org/10.1016/j.isprsjprs.2021.11.020
IF: 12.7
2021-01-01
ISPRS Journal of Photogrammetry and Remote Sensing
Abstract:For remote sensing image (RSI) captioning tasks, two-stage RSI captioning methods have achieved high performance because they introduce the results of other RSI tasks, such as image classification as prior information. However, most previous works treat image classification in the two-stage RSI captioning method as a single-label classification task, which cannot adequately describe the entire contents of complex RSIs and may cause semantic ambiguity. To settle this problem and further refine image feature representation, we introduce multi-label classification into two-stage RSI captioning to provide sufficient and accurate prior semantic information and propose a multi-label semantic feature fusion (MLSFF) framework. Specifically, we design a robust multi-label semantic attribute extractor to extract multi-label semantic attributes of RSIs. To obtain discriminative feature representations for RSI captioning, we propose two cross-modal semantic feature fusion operators that fuse the extracted semantic attributes and the image feature extracted by the convolutional neural network. The results of extensive numerical experiments show that the proposed method can achieve state-of-the-art performance on the UCM-Captions, Sydney-Captions, and RSICD datasets. Specifically, on the UCM-Captions dataset, our method achieves a gain of 8.2% in Sm score over the SAT (LAM) method (Zhang et al., 2019c). On the Sydney-Captions dataset, our method improves the Sm score by 17.4% compared with the TCE loss-based method (Li et al., 2020a). On the RSICD, our method outperforms the multi-level attention method (Li et al., 2020b) by 3.2% in terms of the S-m score. Code is available at https://github.com/xtye5025/MLSFF.
What problem does this paper attempt to address?