Abstract:Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: In the field of remote - sensing images, the existing Vision and Language (V&L) models are difficult to effectively handle image captioning and cross - modal retrieval tasks due to the small scale of available image - text pair datasets and insufficient model complexity. Specifically: 1. **Small - scale datasets**: The image - text pair datasets in the field of remote - sensing images are relatively small, which limits the development and application of V&L models. 2. **Insufficient model complexity**: When dealing with remote - sensing images, the existing models fail to fully utilize the capabilities of large pre - trained language models, resulting in limited performance improvement. To solve these problems, the authors propose the RS - CapRet method, which combines a powerful Large Language Model (LLM) and a visual encoder adapted to remote - sensing images. Through contrastive learning and linear - layer projection, RS - CapRet can perform remote - sensing image captioning and text - based image retrieval tasks without fine - tuning the entire LLM and visual encoder, and has achieved state - of - the - art or competitive results. ### Specific problems and solutions - **Problem**: How to use large - scale pre - trained language models to improve the performance of remote - sensing image captioning and cross - modal retrieval? - **Solution**: Use a highly capable large - decoder language model (such as LLamaV2 - 7B) and adapt to the characteristics of remote - sensing images through contrastive language - image pre - training. - **Problem**: How to effectively combine visual information with language models? - **Solution**: Introduce simple linear layers to project examples from different remote - sensing image caption datasets into the input embedding space of the language model while keeping other parameters frozen. - **Problem**: How to achieve image captioning and text - based image retrieval? - **Solution**: Train special [RET] tokens so that their embedding representations can retrieve the corresponding image embeddings, thereby achieving text - description - based image retrieval. For the image captioning task, RS - CapRet can generate descriptions of remote - sensing image content and support processing interleaved image and text sequences in a conversational form. ### Summary This paper aims to overcome the challenges faced by existing V&L models in the field of remote - sensing images, especially the problems of small - scale datasets and insufficient model complexity, through the RS - CapRet method. By combining a powerful LLM and a visual encoder adapted to remote - sensing images, RS - CapRet has achieved significant performance improvements in image captioning and cross - modal retrieval tasks.

Large Language Models for Captioning and Retrieving Remote Sensing Images

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

RSGPT: A Remote Sensing Vision Language Model and Benchmark

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Exploring Models and Data for Remote Sensing Image Caption Generation

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Towards a multimodal framework for remote sensing image change retrieval and captioning

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

Towards Automatic Satellite Images Captions Generation Using Large Language Models

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Satellite Captioning: Large Language Models to Augment Labeling

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

Discrete diffusion models with Refined Language–Image Pre-trained representations for remote sensing image captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance