RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Yakoub Bazi,Laila Bashmal,Mohamad Mahmoud Al Rahhal,Riccardo Ricci,Farid Melgani

DOI: https://doi.org/10.3390/rs16091477

IF: 5

2024-04-24

Remote Sensing

Abstract:In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model's effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.

environmental sciences,imaging science & photographic technology,remote sensing,geosciences, multidisciplinary

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a large - scale vision - language model (LVLM) capable of performing both image captioning and visual question answering (VQA) tasks simultaneously in remote sensing image analysis. Specifically, the authors propose the RS - LLaVA model, an improved version based on the existing large - scale vision - language assistant model (LLaVA), which is specifically adapted to remote sensing data through the Low - Rank Adaptation (LoRA) method. The main objective of the paper is to overcome the problem of poor performance of current models when dealing with remote sensing images. These images have high resolution, diverse scales and unique acquisition angles, which are fundamentally different from natural images. In addition, the paper also aims to address the lack of a comprehensive instruction dataset specifically designed for the remote sensing field, which is crucial for effectively customizing LVLMs through instruction tuning. To evaluate the performance of the model, the researchers created the RS - instructions dataset, a comprehensive benchmark dataset that integrates four single - task datasets related to captioning and VQA. The experimental results show that the RS - LLaVA model performs better in multi - task mode than the state - of - the - art models in single - task mode, which marks an important step towards developing efficient multi - task models for remote sensing image analysis.

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Large Vision-Language Models for Remote Sensing Visual Question Answering

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

Large Language Models for Captioning and Retrieving Remote Sensing Images

RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents

Vision-Language Models in Remote Sensing: Current progress and future trends

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

RSGPT: A Remote Sensing Vision Language Model and Benchmark

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Enhancing Advanced Visual Reasoning Ability of Large Language Models

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

A Simple LLM Framework for Long-Range Video Question-Answering