Abstract:The recent development of vision language models (VLMs) has led to significant advances in visual-language integration through visual instruction tuning, and they have rapidly evolved in the field of remote sensing image understanding, demonstrating their powerful capabilities. However, existing RSVLMs mainly focus on image-level or frame-level understanding, making it difficult to achieve fine-grained pixel-level visual-language alignment. Additionally, the lack of mask-based instructional data limits their further development. In this paper, we propose a mask-text instruction tuning method called Aquila-plus, which extends the capabilities of RSVLMs to achieve pixel-level visual understanding by incorporating fine-grained mask regions into language instructions. To achieve this, we first meticulously constructed a mask region-text dataset containing 100K samples, and then designed a visual-language model by injecting pixel-level representations into a large language model (LLM). Specifically, Aquila-plus uses a convolutional CLIP as the visual encoder and employs a mask-aware visual extractor to extract precise visual mask features from high-resolution inputs. Experimental results demonstrate that Aquila-plus outperforms existing methods in various region understanding tasks, showcasing its novel capabilities in pixel-level instruction tuning.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiency of existing remote - sensing vision - language models (RSVLMs) in pixel - level vision - language alignment. Specifically: 1. **Limitations of Existing Models**: - Existing RSVLMs mainly focus on image - level or frame - level understanding and it is difficult to achieve fine - grained pixel - level vision - language alignment. - The lack of mask - based instruction data limits the further development of these models. 2. **Research Objectives**: - Propose a new method, Aquila - plus, which extends the capabilities of RSVLMs to achieve pixel - level visual understanding by introducing fine - grained mask regions into language instructions. - Construct a large - scale mask - text dataset, Aquila - plus - 100K, containing 100,000 samples to support pixel - level instruction tuning. 3. **Technological Innovations**: - Use Convolutional CLIP as a visual encoder. Compared with ViT - based models, Convolutional CLIP performs better on high - resolution inputs and has higher efficiency and robustness. - Design a mask - aware visual extractor that can extract accurate visual mask features from high - resolution inputs. - By interleaving visual features with language instructions to form an input sequence, the understanding ability of large - language models (LLMs) for fine - grained visual information is enhanced. 4. **Experimental Verification**: - The experimental results show that Aquila - plus outperforms existing methods in various regional understanding tasks, demonstrating its novel ability in pixel - level instruction tuning. In summary, this paper aims to improve the fine - grained and open - world visual understanding capabilities of vision - language models in remote - sensing image understanding by introducing pixel - level mask regions and constructing large - scale datasets.

Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding

Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension

Osprey: Pixel Understanding with Visual Instruction Tuning

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Large Vision-Language Models for Remote Sensing Visual Question Answering

Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Visually-Augmented Language Modeling

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark