Large Language Models for Captioning and Retrieving Remote Sensing Images

João Daniel Silva,João Magalhães,Devis Tuia,Bruno Martins
2024-02-09
Abstract:Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.
Computer Science
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: In the field of remote - sensing images, the existing Vision and Language (V&L) models are difficult to effectively handle image captioning and cross - modal retrieval tasks due to the small scale of available image - text pair datasets and insufficient model complexity. Specifically: 1. **Small - scale datasets**: The image - text pair datasets in the field of remote - sensing images are relatively small, which limits the development and application of V&L models. 2. **Insufficient model complexity**: When dealing with remote - sensing images, the existing models fail to fully utilize the capabilities of large pre - trained language models, resulting in limited performance improvement. To solve these problems, the authors propose the RS - CapRet method, which combines a powerful Large Language Model (LLM) and a visual encoder adapted to remote - sensing images. Through contrastive learning and linear - layer projection, RS - CapRet can perform remote - sensing image captioning and text - based image retrieval tasks without fine - tuning the entire LLM and visual encoder, and has achieved state - of - the - art or competitive results. ### Specific problems and solutions - **Problem**: How to use large - scale pre - trained language models to improve the performance of remote - sensing image captioning and cross - modal retrieval? - **Solution**: Use a highly capable large - decoder language model (such as LLamaV2 - 7B) and adapt to the characteristics of remote - sensing images through contrastive language - image pre - training. - **Problem**: How to effectively combine visual information with language models? - **Solution**: Introduce simple linear layers to project examples from different remote - sensing image caption datasets into the input embedding space of the language model while keeping other parameters frozen. - **Problem**: How to achieve image captioning and text - based image retrieval? - **Solution**: Train special [RET] tokens so that their embedding representations can retrieve the corresponding image embeddings, thereby achieving text - description - based image retrieval. For the image captioning task, RS - CapRet can generate descriptions of remote - sensing image content and support processing interleaved image and text sequences in a conversational form. ### Summary This paper aims to overcome the challenges faced by existing V&L models in the field of remote - sensing images, especially the problems of small - scale datasets and insufficient model complexity, through the RS - CapRet method. By combining a powerful LLM and a visual encoder adapted to remote - sensing images, RS - CapRet has achieved significant performance improvements in image captioning and cross - modal retrieval tasks.