HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Zhigang Yang,Qiang Li,Yuan Yuan,Qi Wang
DOI: https://doi.org/10.1109/tgrs.2024.3401576
IF: 8.2
2024-05-28
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Remote sensing image captioning (RSIC) aims to describe the crucial objects from remote sensing images in the form of natural language. The inefficient utilization of object texture and semantic features in images, along with the ineffective cross-modal alignment between image and text features, are the primary factors that impact the model to generate high-quality captions. To alleviate this trouble, this article presents a network for RSIC, namely HCNet, including hierarchical feature aggregation and cross-modal feature alignment. Specifically, a hierarchical feature aggregation module (HFAM) is proposed to obtain a comprehensive representation of vision features, which is beneficial for producing accurate descriptions. Considering the disparities between different modal features, we design a cross-modal feature interaction module (CFIM) in the decoder to facilitate feature alignment. It can fully utilize cross-modal features to localize critical objects. Besides, a cross-modal feature align loss is introduced to realize the alignment between image and text features. Extensive experiments show our HCNet can achieve satisfactory performance. In particular, we demonstrate significant performance improvements of +14.15% CIDEr score on NWPU datasets compared to existing approaches. The source code is publicly available at https://github.com/CVer-Yang/HCNet.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?