Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning

Qi Wang,Zhigang Yang,Weiping Ni,Junzhen Wu,Qiang Li
DOI: https://doi.org/10.1109/tgrs.2024.3502805
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Image captioning is a fundamental vision-language task with wide-ranging applications in daily life. Existing methods often struggle to accurately interpret the semantic information in remote sensing images due to the complexity of backgrounds. Target region masks can effectively reflect the shape characteristics of targets and their potential interrelationships. Therefore, incorporating and fully integrating these features can significantly improve the quality of generated captions. However, researchers are hindered by the lack of relevant datasets that contain corresponding object masks. It is natural to ask: How to efficiently introduce and utilize object masks? In this paper, we provide potential target masks for the publicly available remote sensing image caption (RSIC) datasets, enabling models to utilize the regional features of targets for RSIC. Meanwhile, a novel RSIC algorithm is proposed that combines regional positional features with fine-grained semantic information, abbreviated as S2CPNet. To effectively capture the semantic information from image and position relationship from mask respectively, the semantic and spatial feature enhance sub-modules are introduced at the ends of encoder branches, respectively. Furthermore, the cross view feature fusion module is designed to integrate regional features and semantic information efficiently. Then, a target recognition decoder is developed to enhance the ability of model to identify and describe critical targets in images. Finally, we improve the caption generation decoder by adaptively merging textual information with visual features to generate more accurate descriptions. Our model achieve satisfactory results on three RSIC datasets compared with existing method. The related datasets and code will be open-sourced in https://github.com/CVer-Yang/SSCPNet.
What problem does this paper attempt to address?