Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation
Jie Wang,Binze Wang,Jiangbo Xi,Xue Bai,Okan K. Ersoy,Ming Cong,Siyan Gao,Zhe Zhao
DOI: https://doi.org/10.1109/lgrs.2024.3366984
IF: 5.343
2024-01-01
IEEE Geoscience and Remote Sensing Letters
Abstract:As a successful application of machine learning in remote sensing (RS) and natural language processing, image captioning of remote-sensing images has been promoted and developed. Remote sensing images are large in width, complex in features, and contain abundant information. It is a difficult task to extract available visual features-based domain knowledge behind sufficiently and to utilize extracted feature for image captioning generation sufficiently. In order to overcome this difficulty, we propose a novel model based “encoder-decoder” framework, termed remote sensing image captioning with sequential attention and flexible word correlation (SA-FWC). In the encoder, we fuse features of different layers in VGG16 to extract global and local information. In the decoder, we propose SA-FWC to utilize extracted visual information to generate accurate image captioning sufficiently. Specially, to utilize visual features from the encoding layer sufficiently, highlight important information and reduce redundant information, long short-term memory (LSTM) in SA-FWC is used for obtaining better feature representations. Feature fusion strategy and self-attention mechanism to utilize visual features sufficiently. Additionally, we provide a data augmentation strategy-based minimal training sample pairs. In the experiments, four evaluation metrics are used to evaluate the experimental results, and the effects of various parameters on the experimental results are discussed. The experimental results (BELU-0.72, ROUGE-0.65, METEOR-0.37, and CIDEr-2.83) show that the proposed method is effective and outperforms other network structures.