Cooperative Connection Transformer for Remote Sensing Image Captioning

Kai Zhao,Wei Xiong
DOI: https://doi.org/10.1109/tgrs.2024.3360089
IF: 8.2
2024-02-10
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Feature extraction is fundamental for successful remote sensing image captioning (RSIC). The representation forms of grid features and region features differ significantly. Grid features can provide fine-grained information while retaining background information for subsequent processes but lack salient target information. Region features can provide object-level information but lack contextual information. Thus, to fully utilize the advantages of both and achieve RSIC tasks, we propose a cooperative connection Transformer (CCT). In the encoder, we specifically designed grid and region features. We used the features after global average pooling (GAP) as grid features to provide global background information. We also extract region features to provide salient region information. To solve the information loss caused by region feature extraction, we propose mapping region features and fusing them with grid features using a fusion attention module that filters out the information redundancy and noise generated by fusion. Cooperative connection attention is added to the decoder to better utilize different feature types to improve its perception of different features by directly connecting them to each decoder layer. To address the lack of region feature annotations in the RSIC field, we provide region feature annotations based on published datasets and extract region features accordingly. Extensive experiments demonstrate the effectiveness and superiority of the proposed method. Labeled region annotations are available at https://github.com/zk-1019/region-annotations.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?