DP-RSCAP: Dual Prompt-Based Scene and Entity Network for Remote Sensing Image Captioning

Lanxiao Wang,Heqian Qiu,Minjian Zhang,Fanman Meng,Qingbo Wu,Hongliang Li
DOI: https://doi.org/10.1109/igarss53475.2024.10641565
2024-01-01
Abstract:As a challenging task towards remote sensing image analysis, the core problem of remote sensing image captioning is how to accurately transform the vision information into text information. Existing methods usually achieve it based on the simple multi-task learning strategy or visual attention mechanism, which ignores the importance of intermediate connection information for cross-modal transformation. To solve above problem, we propose a novel dual prompt-based scene and entity network (DP-RSCap) which aims to fully utilize the ability of cross-modal alignment in vision-language model build text prior information as intermediate connection to narrow the gap between different modalities and improve the quality of caption. Specifically, we first introduce an entity-concept prompt exporter to obtain explicit entity concepts in images. Then, we design a scene class prompt generator which can predict scene class and obtain fine-grained visual semantic features. Finally, we further design a dual prompt-based caption decoder to align and merge the visual semantic feature and dual prompts information as explicit intermediate connections, which can assist in generating precise caption. Extensive experiments on the challenging RSICD demonstrate the superior ability of our model.
What problem does this paper attempt to address?