RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Yakoub Bazi,Laila Bashmal,Mohamad Mahmoud Al Rahhal,Riccardo Ricci,Farid Melgani
DOI: https://doi.org/10.3390/rs16091477
IF: 5
2024-04-24
Remote Sensing
Abstract:In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model's effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.
environmental sciences,imaging science & photographic technology,remote sensing,geosciences, multidisciplinary
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a large - scale vision - language model (LVLM) capable of performing both image captioning and visual question answering (VQA) tasks simultaneously in remote sensing image analysis. Specifically, the authors propose the RS - LLaVA model, an improved version based on the existing large - scale vision - language assistant model (LLaVA), which is specifically adapted to remote sensing data through the Low - Rank Adaptation (LoRA) method. The main objective of the paper is to overcome the problem of poor performance of current models when dealing with remote sensing images. These images have high resolution, diverse scales and unique acquisition angles, which are fundamentally different from natural images. In addition, the paper also aims to address the lack of a comprehensive instruction dataset specifically designed for the remote sensing field, which is crucial for effectively customizing LVLMs through instruction tuning. To evaluate the performance of the model, the researchers created the RS - instructions dataset, a comprehensive benchmark dataset that integrates four single - task datasets related to captioning and VQA. The experimental results show that the RS - LLaVA model performs better in multi - task mode than the state - of - the - art models in single - task mode, which marks an important step towards developing efficient multi - task models for remote sensing image analysis.