Abstract:Recent progress has shown that integrating multiscale visual features with advanced Transformer architectures is a promising approach for remote sensing image captioning (RSIC). However, the lack of local modeling ability in self-attention may potentially lead to inaccurate contextual information. Moreover, the scarcity of trainable image–caption pairs poses challenges in effectively harnessing the semantic alignment between images and texts. To mitigate these issues, we propose a Multiscale Grouping Transformer with CLIP latents (MG-Transformer) for RSIC. First of all, a CLIP image embedding and a set of region features are extracted within a multilevel feature extraction (MFE) module. To achieve a comprehensive image representation, a semantic correlation (SC) module is designed to integrate the image embedding and region features with an attention gate. Subsequently, the integrated image features are fed into a Transformer model. The Transformer encoder utilizes dilated convolutions (DCs) with different dilation rates to obtain multiscale visual features. To enhance the local modeling ability of the self-attention mechanism in the encoder, we introduce a global grouping attention (GGA) mechanism. This mechanism incorporates a grouping operation into self-attention, allowing each attention head to focus on different contextual information. The Transformer decoder then adopts the meshed cross-attention mechanism to establish relationships between various scales of visual features and text features. This facilitates the generation of captions for images by the decoder. Experimental results on three RSIC datasets demonstrate the superiority of the proposed MG-Transformer. The code will be publicly available at https://github.com/One-paper-luck/MG-Transformer.

Adaptively Clustering Neighbor Elements for Image-Text Generation

ClusterFormer: Clustering As A Universal Visual Learner

Adaptive Semantic-Enhanced Transformer for Image Captioning.

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Adaptively Aligned Image Captioning via Adaptive Attention Time

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding.

CAT: Cross Attention in Vision Transformer

Adaptive Split-Fusion Transformer

Layer-wise enhanced transformer with multi-modal fusion for image caption

Task-Adaptive Attention for Image Captioning

Avtmnet: Adaptive Visual-Text Merging Network for Image Captioning

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Auxiliary feature extractor and dual attention-based image captioning

Based-CLIP early fusion transformer for image caption

Cluster-Former: Clustering-based Sparse Transformer for Question Answering.

Dual visual align-cross attention-based image captioning transformer

An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture

Context-Aware Transformer for image captioning

Entangled Transformer for Image Captioning