Multi-View Feature Fusion and Visual Prompt for Remote Sensing Image Captioning

Shuang Wang,Qiaoling Lin,Xiutiao Ye,Yu Liao,Dou Quan,Zhongqian Jin,Biao Hou,Licheng Jiao
DOI: https://doi.org/10.1109/tgrs.2024.3426359
2024-01-01
Abstract:Remote sensing image (RSI) captioning is a vision-language multimodal task concentrating on both image comprehension and sentence generation. Several studies suggest that encoder-decoder-based methods have achieved success in RSI captioning. However, existing encoder-decoder-based methods may not fully explore image representations for RSI captioning and suffer from a lack of additional prompt information for sentence generation. In this article, a novel multi-view feature fusion and prompt (MVP)-based model is proposed to obtain better RSI representations and enhance language model performance in RSI captioning. Specifically, we design an attention-based feature fusion module to dynamically fuse multi-view visual features, which are extracted from the fine-tuned vision-language pretraining (VLP) model and the vision-task pretraining (VP) model. Then, a flexible visual prefix mapping module is proposed to transform images into visual prefixes, providing semantic information for the subsequent sentence generation. Finally, a BERT-based caption generator is applied to generate accurate descriptions based on the fused visual features and the visual prefixes, which are both outputs from our designed modules. Extensive experiments are conducted on three well-known benchmark datasets, demonstrating that our method achieves state-of-the-art (SOTA) performance.
What problem does this paper attempt to address?