Abstract:Despite remarkable progress, existing multimodal large language models (MLLMs) are still inferior in granular visual recognition. Contrary to previous works, we study this problem from the perspective of image resolution, and reveal that a combination of low- and high-resolution visual features can effectively mitigate this shortcoming. Based on this observation, we propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA). In particular, MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway via the novel mixture-of-resolution adapters (MR-Adapters). This design also greatly reduces the input sequence length of MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL) tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g., +9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR remain efficient with MRA, e.g., 20 training hours and 3$\times$ inference speed than LLaVA-1.5. Source codes are released at:

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the poor performance of existing multi - modal large language models (MLLMs) in fine - grained visual recognition tasks. Although these models have made significant progress in many tasks, their performance is still not satisfactory when dealing with tasks requiring fine - grained visual understanding, such as TextVQA. Specifically: 1. **Deficiencies in fine - grained visual recognition**: Existing MLLMs are prone to hallucinations when dealing with small or occluded objects, which limits their effectiveness in practical applications. 2. **Computational cost problems brought by high - resolution images**: Although increasing the resolution of the input image can improve visual recognition performance, it also leads to an increase in computational complexity and training instability, especially at high resolutions. To solve these problems, the author proposes a new method - **Mixture - of - Resolution Adaptation (MRA)**. MRA simultaneously processes high - and low - resolution images through the introduction of a dual - visual - path design and embeds high - resolution information into the low - resolution path through a novel Mixture - Resolution Adapter (MR - Adapter). This method not only improves the model's visual recognition ability but also maintains the efficiency of training and inference. ### Specific contributions: 1. **Reveal the importance of image resolution for MLLMs** and propose an efficient Mixture - of - Resolution Adaptation scheme (MRA) that can take advantage of high - resolution images while maintaining efficiency. 2. **Design a novel Mixture - Resolution Adapter (MR - Adapter)** that can embed high - resolution information into the low - resolution path, thereby enhancing the visual description ability. 3. **Propose a powerful MLLM model LLaVA - HR based on MRA**, which outperforms existing MLLMs in 8 out of 11 visual - language tasks and has a much lower training cost than most MLLMs. ### Experimental results: Experiments show that LLaVA - HR performs well in multiple visual - language tasks, especially in the TextVQA task, with a performance improvement of 9.4%. In addition, the training and inference speed of LLaVA - HR is also much faster than that of existing models. For example, at the same resolution, its inference speed is three times that of LLaVA - 1.5. ### Summary: This paper solves the deficiencies of existing MLLMs in fine - grained visual recognition tasks by introducing Mixture - of - Resolution Adaptation (MRA) and verifies its effectiveness and efficiency through a series of experiments.

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

Visual Perception by Large Language Model's Weights

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Multiway-Adapter: Adapting Multimodal Large Language Models for Scalable Image-Text Retrieval

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models