Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Gen Luo,Yiyi Zhou,Yuxin Zhang,Xiawu Zheng,Xiaoshuai Sun,Rongrong Ji
2024-03-05
Abstract:Despite remarkable progress, existing multimodal large language models (MLLMs) are still inferior in granular visual recognition. Contrary to previous works, we study this problem from the perspective of image resolution, and reveal that a combination of low- and high-resolution visual features can effectively mitigate this shortcoming. Based on this observation, we propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA). In particular, MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway via the novel mixture-of-resolution adapters (MR-Adapters). This design also greatly reduces the input sequence length of MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL) tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g., +9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR remain efficient with MRA, e.g., 20 training hours and 3$\times$ inference speed than LLaVA-1.5. Source codes are released at:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the poor performance of existing multi - modal large language models (MLLMs) in fine - grained visual recognition tasks. Although these models have made significant progress in many tasks, their performance is still not satisfactory when dealing with tasks requiring fine - grained visual understanding, such as TextVQA. Specifically: 1. **Deficiencies in fine - grained visual recognition**: Existing MLLMs are prone to hallucinations when dealing with small or occluded objects, which limits their effectiveness in practical applications. 2. **Computational cost problems brought by high - resolution images**: Although increasing the resolution of the input image can improve visual recognition performance, it also leads to an increase in computational complexity and training instability, especially at high resolutions. To solve these problems, the author proposes a new method - **Mixture - of - Resolution Adaptation (MRA)**. MRA simultaneously processes high - and low - resolution images through the introduction of a dual - visual - path design and embeds high - resolution information into the low - resolution path through a novel Mixture - Resolution Adapter (MR - Adapter). This method not only improves the model's visual recognition ability but also maintains the efficiency of training and inference. ### Specific contributions: 1. **Reveal the importance of image resolution for MLLMs** and propose an efficient Mixture - of - Resolution Adaptation scheme (MRA) that can take advantage of high - resolution images while maintaining efficiency. 2. **Design a novel Mixture - Resolution Adapter (MR - Adapter)** that can embed high - resolution information into the low - resolution path, thereby enhancing the visual description ability. 3. **Propose a powerful MLLM model LLaVA - HR based on MRA**, which outperforms existing MLLMs in 8 out of 11 visual - language tasks and has a much lower training cost than most MLLMs. ### Experimental results: Experiments show that LLaVA - HR performs well in multiple visual - language tasks, especially in the TextVQA task, with a performance improvement of 9.4%. In addition, the training and inference speed of LLaVA - HR is also much faster than that of existing models. For example, at the same resolution, its inference speed is three times that of LLaVA - 1.5. ### Summary: This paper solves the deficiencies of existing MLLMs in fine - grained visual recognition tasks by introducing Mixture - of - Resolution Adaptation (MRA) and verifies its effectiveness and efficiency through a series of experiments.