Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding

Kaixuan Lu
2024-11-09
Abstract:The recent development of vision language models (VLMs) has led to significant advances in visual-language integration through visual instruction tuning, and they have rapidly evolved in the field of remote sensing image understanding, demonstrating their powerful capabilities. However, existing RSVLMs mainly focus on image-level or frame-level understanding, making it difficult to achieve fine-grained pixel-level visual-language alignment. Additionally, the lack of mask-based instructional data limits their further development. In this paper, we propose a mask-text instruction tuning method called Aquila-plus, which extends the capabilities of RSVLMs to achieve pixel-level visual understanding by incorporating fine-grained mask regions into language instructions. To achieve this, we first meticulously constructed a mask region-text dataset containing 100K samples, and then designed a visual-language model by injecting pixel-level representations into a large language model (LLM). Specifically, Aquila-plus uses a convolutional CLIP as the visual encoder and employs a mask-aware visual extractor to extract precise visual mask features from high-resolution inputs. Experimental results demonstrate that Aquila-plus outperforms existing methods in various region understanding tasks, showcasing its novel capabilities in pixel-level instruction tuning.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiency of existing remote - sensing vision - language models (RSVLMs) in pixel - level vision - language alignment. Specifically: 1. **Limitations of Existing Models**: - Existing RSVLMs mainly focus on image - level or frame - level understanding and it is difficult to achieve fine - grained pixel - level vision - language alignment. - The lack of mask - based instruction data limits the further development of these models. 2. **Research Objectives**: - Propose a new method, Aquila - plus, which extends the capabilities of RSVLMs to achieve pixel - level visual understanding by introducing fine - grained mask regions into language instructions. - Construct a large - scale mask - text dataset, Aquila - plus - 100K, containing 100,000 samples to support pixel - level instruction tuning. 3. **Technological Innovations**: - Use Convolutional CLIP as a visual encoder. Compared with ViT - based models, Convolutional CLIP performs better on high - resolution inputs and has higher efficiency and robustness. - Design a mask - aware visual extractor that can extract accurate visual mask features from high - resolution inputs. - By interleaving visual features with language instructions to form an input sequence, the understanding ability of large - language models (LLMs) for fine - grained visual information is enhanced. 4. **Experimental Verification**: - The experimental results show that Aquila - plus outperforms existing methods in various regional understanding tasks, demonstrating its novel ability in pixel - level instruction tuning. In summary, this paper aims to improve the fine - grained and open - world visual understanding capabilities of vision - language models in remote - sensing image understanding by introducing pixel - level mask regions and constructing large - scale datasets.