Abstract:Recently, remote sensing image captioning has gained significant attention in the remote sensing community. Due to the significant differences in spatial resolution of remote sensing images, existing methods in this field have predominantly concentrated on the fine-grained extraction of remote sensing image features, but they cannot effectively handle the semantic consistency between visual features and textual features. To efficiently align the image-text, we propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA, which relies on the design of a lightweight interactive Fourier Transformer to better align remote sensing image-text features. The Fourier layer in the interactive Fourier Transformer is capable of extracting multi-scale features of remote sensing images in the frequency domain, thereby reducing the redundancy of remote sensing visual features. Specifically, the first stage involves preliminary alignment through image-text contrastive learning, which aligns the learned multi-scale remote sensing features from the interactive Fourier Transformer with textual features. In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model. Then, prefix causal language modeling is utilized to guide the text generation process using visual features. Ultimately, across the UCM-caption, RSICD, and NWPU-caption datasets, the experimental results clearly demonstrate that BITA outperforms other advanced comparative approaches. The code is available at <a class="link-external link-https" href="https://github.com/yangcong356/BITA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the image - text alignment problem in remote sensing image captioning tasks. Specifically, although existing methods have made significant progress in extracting fine - grained features of remote sensing images, they are insufficient in handling the semantic consistency between visual features and text features. The paper proposes a new two - stage vision - language pre - training method (called BITA), which better aligns remote sensing image and text features by designing a lightweight Interactive Fourier Transformer (IFT). ### Main contributions 1. **Introduction of the vision - language pre - training paradigm**: - Introduce the vision - language pre - training (VLP) paradigm into the remote sensing image captioning task and propose a new VLP model specifically designed for the remote sensing image captioning task. - Through the first - stage image - text contrastive learning and the second - stage language modeling guidance, BITA can obtain robust visual features and achieve visual - semantic alignment of objects in remote sensing image - text pairs. 2. **Design of the Interactive Fourier Transformer (IFT) module**: - The IFT module serves as an intermediary between the frozen visual encoder and the frozen large - language model (LLM), using parameter - free Fourier transform to encode image and text information and reduce model parameters. - Efficiently learn multi - scale features of remote sensing images in the frequency domain through Fourier transform. 3. **Two - stage pre - training process**: - **First stage**: Through image - text contrastive learning, constrain the IFT module to learn the most relevant and valuable visual representations for the text. - **Second stage**: Concatenate the visual features learned by IFT with the encoded text features and input them into the LLM, and use language modeling learning to guide visual - to - language generative learning. ### Method overview 1. **Vision - language pre - training setup**: - Use deep neural networks to extract image and text features from pre - training datasets to form image - text pairs. - Image and text embeddings are aligned through predefined pre - training tasks and finally input into the decoder to generate the target text. 2. **Discrete Fourier transform**: - Introduce the one - dimensional discrete Fourier transform (1D DFT) and its applications in signal processing. - The Fourier transform can decompose the input image into different frequency components, reflecting the multi - scale features of the input image. 3. **Model architecture**: - Build a lightweight and trainable IFT module using parameter - free Fourier transform and cross - attention mechanism. - The IFT module contains two sub - modules: a Fourier - transform - based image Transformer and a Fourier - transform - based text Transformer. - Through the cross - attention layer, the visual cue embeddings interact with the visual features extracted by the frozen image encoder to achieve an efficient low - dimensional visual feature representation. 4. **Representation learning stage**: - Maximize the mutual information between images and texts through image - text contrastive learning (ITC) to learn joint representations. - ITC optimizes the bidirectional objective function by comparing the similarities of positive and negative sample pairs. 5. **Visual - feature - guided language generation learning stage**: - Connect the pre - trained IFT and the frozen LLM and utilize the language generation and reasoning capabilities of the LLM. - Use prefix - causal language modeling (PCLM) to control the interaction between visual cue embeddings and text embeddings and generate conditional text outputs. ### Experimental results The paper conducted experiments on three datasets, namely UCM - caption, RSICD, and NWPU - caption, and the results show that BITA outperforms other advanced comparison methods on these datasets. ### Summary By introducing the vision - language pre - training paradigm and designing the Interactive Fourier Transformer module, this paper effectively solves the image - text alignment problem in the remote sensing image captioning task and improves the semantic consistency and accuracy of the generated captions.

Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning

Bootstrapping Interactive Image–Text Alignment for Remote Sensing Image Captioning

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

A Joint-Training Two-Stage Method For Remote Sensing Image Captioning.

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

Learning Video-Text Aligned Representations for Video Captioning

Improving Image Captioning through Visual and Semantic Mutual Promotion

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Remote Sensing Image Captioning Based on Multi-Level Feature Extraction and Adaptive Attention

Bidirectional interactive alignment network for image captioning

Improving OCR-based Image Captioning by Incorporating Geometrical Relationship

Exploring Visual Relationships Via Transformer-based Graphs for Enhanced Image Captioning

Aligning Where to See and What to Tell: Image Caption with Region-Based Attention and Scene Factorization

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Tag‐inferring and tag‐guided Transformer for image captioning

Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Improving Image Captioning with Better Use of Caption

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Multi-View Feature Fusion and Visual Prompt for Remote Sensing Image Captioning