Abstract:Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions. Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions. In this paper, we propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process. Our approach consists of a two-step training strategy: domain-adaptive pretraining and prompt-based finetuning. This method enables the LVLM to generate natural language answers by conditioning on both visual and textual inputs, without the need for predefined answer categories. We evaluate our model on the RSVQAxBEN dataset, demonstrating superior performance compared to state-of-the-art baselines. Additionally, a human evaluation study shows that our method produces answers that are more accurate, relevant, and fluent. The results highlight the potential of generative LVLMs in advancing the field of remote sensing analysis.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use large vision - language models (LVLMs) to improve the understanding ability of complex satellite images and generate natural - language answers in the remote - sensing visual question - answering (RSVQA) task. Traditional RSVQA methods usually rely on independent visual feature extractors and language processing models, which are not only computationally costly but also have limited ability in dealing with open - ended questions. This paper proposes a new method to optimize the performance of LVLMs in the RSVQA task through a two - step training strategy - domain - adaptation pre - training and prompt - based fine - tuning. Specifically, the main objectives of the paper include: 1. **Improve the performance of LVLMs in the RSVQA task**: By combining domain - adaptation pre - training and prompt - based fine - tuning, enable LVLMs to better understand and generate natural - language answers related to remote - sensing images. 2. **Reduce the dependence on predefined answer categories**: Traditional RSVQA methods often require predefined answer categories, while the method in this paper can directly generate natural - language answers without these limitations. 3. **Enhance the generalization ability of the model**: Through domain - adaptation pre - training and prompt - based fine - tuning, improve the performance of the model when dealing with new data and unseen remote - sensing images. The paper demonstrates through experiments on the RSVQA xBEN dataset that the proposed method outperforms the existing state - of - the - art methods in multiple metrics (such as accuracy, multiple - choice question accuracy, and F1 score for open - ended questions). In addition, human evaluation studies also show that the answers generated by this method are more accurate, relevant, and fluent. These results highlight the potential of generative LVLMs in advancing the field of remote - sensing analysis.

Large Vision-Language Models for Remote Sensing Visual Question Answering

RSVQA: Visual Question Answering for Remote Sensing Data

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing

The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation

Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck

Vision-Language Models in Remote Sensing: Current progress and future trends

Enhancing Remote Sensing Visual Question Answering: A Mask-Based Dual-Stream Feature Mutual Attention Network

Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

PERS: Parameter-Efficient Multimodal Transfer Learning for Remote Sensing Visual Question Answering

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

BETTER GENERIC OBJECTS COUNTING WHEN ASKING QUESTIONS TO IMAGES: A MULTITASK APPROACH FOR REMOTE SENSING VISUAL QUESTION ANSWERING

See, Perceive, and Answer: A Unified Benchmark for High-Resolution Postdisaster Evaluation in Remote Sensing Images

How to find a good image-text embedding for remote sensing visual question answering?

RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents