A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

Lingwu Meng,Jing Wang,Ran Meng,Yang Yang,Liang Xiao
DOI: https://doi.org/10.1109/tgrs.2024.3385500
IF: 8.2
2024-04-24
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Recent progress has shown that integrating multiscale visual features with advanced Transformer architectures is a promising approach for remote sensing image captioning (RSIC). However, the lack of local modeling ability in self-attention may potentially lead to inaccurate contextual information. Moreover, the scarcity of trainable image–caption pairs poses challenges in effectively harnessing the semantic alignment between images and texts. To mitigate these issues, we propose a Multiscale Grouping Transformer with CLIP latents (MG-Transformer) for RSIC. First of all, a CLIP image embedding and a set of region features are extracted within a multilevel feature extraction (MFE) module. To achieve a comprehensive image representation, a semantic correlation (SC) module is designed to integrate the image embedding and region features with an attention gate. Subsequently, the integrated image features are fed into a Transformer model. The Transformer encoder utilizes dilated convolutions (DCs) with different dilation rates to obtain multiscale visual features. To enhance the local modeling ability of the self-attention mechanism in the encoder, we introduce a global grouping attention (GGA) mechanism. This mechanism incorporates a grouping operation into self-attention, allowing each attention head to focus on different contextual information. The Transformer decoder then adopts the meshed cross-attention mechanism to establish relationships between various scales of visual features and text features. This facilitates the generation of captions for images by the decoder. Experimental results on three RSIC datasets demonstrate the superiority of the proposed MG-Transformer. The code will be publicly available at https://github.com/One-paper-luck/MG-Transformer.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?