LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Zhenshi Li,Dilxat Muhtar,Feng Gu,Xueliang Zhang,Pengfeng Xiao,Guangjun He,Xiaoxiang Zhu
2024-11-14
Abstract:Automatically and rapidly understanding Earth's surface is fundamental to our grasp of the living environment and informed decision-making. This underscores the need for a unified system with comprehensive capabilities in analyzing Earth's surface to address a wide range of human needs. The emergence of multimodal large language models (MLLMs) has great potential in boosting the efficiency and convenience of intelligent Earth observation. These models can engage in human-like conversations, serve as unified platforms for understanding images, follow diverse instructions, and provide insightful feedbacks. In this study, we introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images, designed to expertly perform a wide range of RS understanding tasks aligned with human instructions. LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment. To further enhance RS-oriented vision-language alignment, we propose a large-scale RS image-caption dataset, generated through feature-guided image recaptioning. Additionally, we introduce an instruction dataset specifically designed to improve spatial recognition abilities. Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks. We also evaluate different MLLM performances in complex RS perception and instruction following using a complicated multi-choice question evaluation benchmark, providing a reliable guide for future model selection and improvement. Data, code, and models will be available at <a class="link-external link-https" href="https://github.com/NJU-LHRS/LHRS-Bot" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Lack of high - quality large - scale image - caption datasets**: The datasets currently used in visual - language pre - training of remote - sensing multi - modal large language models (MLLMs) often have problems such as high noise, lack of rich information, poor semantics, low sentence diversity, and excessive focus on salient objects. These problems limit the model's effective modal alignment ability. 2. **Weak spatial recognition ability and hallucination tendency**: Existing remote - sensing MLLMs have low accuracy in spatial positioning and are prone to generate hallucination responses, that is, give incorrect answers when facing problems beyond their capabilities. 3. **Challenges in comprehensively evaluating MLLMs**: Although these models perform well in common tasks such as classification, visual question answering, and visual localization, the existing evaluation metrics cannot comprehensively reflect their abilities in complex scene understanding, object attribute recognition, spatial relationship recognition, and following human instructions. To address the above problems, the paper proposes LHRS - Bot - Nova, an improved multi - modal large language model specifically for understanding and interpreting remote - sensing images. LHRS - Bot - Nova solves the above problems in the following ways: - **Constructing a high - quality large - scale image - caption dataset**: The paper proposes a feature - guided image re - captioning method and generates a new dataset named LHRS - Align - Recap, which has a richer vocabulary, more diverse sentence structures, and stronger visual - language alignment quality. - **Enhancing spatial recognition ability and reducing hallucinations**: By expanding the LHRS - Instruct dataset, adding more conversations about localization and perception, and introducing an off - the - shelf robust visual instruction dataset containing a large number of negative samples to balance the dataset and reduce the possibility of the model generating hallucinations. - **Optimizing the model architecture**: The paper designs an enhanced visual encoder and a new visual perceptron based on the MoE structure, which improves the model's visual - language alignment performance and visual understanding ability. - **Systematic evaluation**: The paper not only evaluates the performance of LHRS - Bot - Nova on standard remote - sensing tasks but also uses a multiple - choice question evaluation benchmark (LHRS - Bench) to comprehensively evaluate the model's instruction - following ability and other remote - sensing - specific abilities from multiple dimensions. Through these improvements, LHRS - Bot - Nova shows excellent performance in various remote - sensing image understanding tasks, especially making significant progress in spatial recognition and reducing hallucinations.