Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

An Zhao,Wenzhong Yang,Danny Chen,Fuyuan Wei
DOI: https://doi.org/10.3390/electronics13183605
IF: 2.9
2024-09-11
Electronics
Abstract:Remote-sensing image captioning (RSIC) aims to generate descriptive sentences for ages by capturing both local and global semantic information. This task is challenging due to the diverse object types and varying scenes in ages. To address these challenges, we propose a positional-channel semantic fusion transformer (PCSFTr). The PCSFTr model employs scene classification to initially extract visual features and learn semantic information. A novel positional-channel multi-headed self-attention (PCMSA) block captures spatial and channel dependencies simultaneously, enriching the semantic information. The feature fusion (FF) module further enhances the understanding of semantic relationships. Experimental results show that PCSFTr significantly outperforms existing methods. Specifically, the BLEU-4 index reached 78.42% in UCM-caption, 54.42% in RSICD, and 69.01% in NWPU-captions. This research provides new insights into RSIC by offering a more comprehensive understanding of semantic information and relationships within images and improving the performance of image captioning models.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?