Large Vision-Language Models for Remote Sensing Visual Question Answering

Surasakdi Siripong,Apirak Chaiyapan,Thanakorn Phonchai
2024-11-17
Abstract:Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions. Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions. In this paper, we propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process. Our approach consists of a two-step training strategy: domain-adaptive pretraining and prompt-based finetuning. This method enables the LVLM to generate natural language answers by conditioning on both visual and textual inputs, without the need for predefined answer categories. We evaluate our model on the RSVQAxBEN dataset, demonstrating superior performance compared to state-of-the-art baselines. Additionally, a human evaluation study shows that our method produces answers that are more accurate, relevant, and fluent. The results highlight the potential of generative LVLMs in advancing the field of remote sensing analysis.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use large vision - language models (LVLMs) to improve the understanding ability of complex satellite images and generate natural - language answers in the remote - sensing visual question - answering (RSVQA) task. Traditional RSVQA methods usually rely on independent visual feature extractors and language processing models, which are not only computationally costly but also have limited ability in dealing with open - ended questions. This paper proposes a new method to optimize the performance of LVLMs in the RSVQA task through a two - step training strategy - domain - adaptation pre - training and prompt - based fine - tuning. Specifically, the main objectives of the paper include: 1. **Improve the performance of LVLMs in the RSVQA task**: By combining domain - adaptation pre - training and prompt - based fine - tuning, enable LVLMs to better understand and generate natural - language answers related to remote - sensing images. 2. **Reduce the dependence on predefined answer categories**: Traditional RSVQA methods often require predefined answer categories, while the method in this paper can directly generate natural - language answers without these limitations. 3. **Enhance the generalization ability of the model**: Through domain - adaptation pre - training and prompt - based fine - tuning, improve the performance of the model when dealing with new data and unseen remote - sensing images. The paper demonstrates through experiments on the RSVQA xBEN dataset that the proposed method outperforms the existing state - of - the - art methods in multiple metrics (such as accuracy, multiple - choice question accuracy, and F1 score for open - ended questions). In addition, human evaluation studies also show that the answers generated by this method are more accurate, relevant, and fluent. These results highlight the potential of generative LVLMs in advancing the field of remote - sensing analysis.