Abstract:Remote Sensing Image Captioning (RSIC) is a crucial task in interpreting remote sensing images (RSIs), as it involves describing their content using clear and precise natural language. However, the RSIC encounters difficulties due to the intricate structure and distinctive features of the images, such as the issue of rotational ambiguity. The existence of visually alike objects or areas can result in misidentification. Additionally, prioritizing groups of objects with strong relational ties during the captioning process poses a significant challenge. To address these challenges, we propose the Visual Rotated Position Encoding Transformer for RSIC. First of all, rotation-invariant features and global features are extracted using a Multi-level Feature Extraction (MFE) module. To focus on closely related rotated objects, we design a visual rotated position encoding (VRoPE) module, which is incorporated into the Transformer encoder to model directional relationships between objects. To distinguish similar features and guide caption generation, we propose a Feature Enhancement Fusion (FEF) module consisting of feature enhancement and feature fusion. The feature enhancement component adopts a self-attention mechanism to construct fully-connected graphs for object features. The feature fusion component integrates global features and word vectors to guide the caption generation process. In addition, we construct a RSI rotated object detection dataset RSIC-ROD and pre-train a rotated object detector. The proposed method demonstrates significant performance improvements on four datasets, showcasing enhanced capabilities in preserving descriptive details, distinguishing similar objects, and accurately capturing object relationships. The code will be publicly available at https://github.com/AnliLiu/VRoPE .

Multi-View Feature Fusion and Visual Prompt for Remote Sensing Image Captioning

Multi-label Semantic Feature Fusion for Remote Sensing Image Captioning

Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation

Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Remote Sensing Image Captioning Based on Multi-Level Feature Extraction and Adaptive Attention

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Attribute-Prompting Multi-Modal Object Reasoning Transformer for Remote Sensing Visual Grounding

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

Semantic Enhanced Video Captioning with Multi-feature Fusion

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Research on Video Captioning Based on Multifeature Fusion.

Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing

Video Captioning with External Knowledge Assistance and Multi-feature Fusion

Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning.

Improving Image Captioning through Visual and Semantic Mutual Promotion

Learning Consensus-Aware Semantic Knowledge for Remote Sensing Image Captioning

DP-RSCAP: Dual Prompt-Based Scene and Entity Network for Remote Sensing Image Captioning