MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Xiangyu Zhao,Xiangtai Li,Haodong Duan,Haian Huang,Yining Li,Kai Chen,Hua Yang
2024-06-27
Abstract:Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at <a class="link-external link-https" href="https://github.com/PhoenixZ810/MG-LLaVA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on how to improve the performance of Multimodal Large Language Models (MLLMs) in handling visual understanding tasks, especially in processing high-resolution images and detailed information. Most existing models can only handle low-resolution images, which limits their effectiveness in tasks that require detailed visual information. The paper proposes a new model called MG-LLaVA, which introduces a multi-granularity visual flow, combining low-resolution, high-resolution, and object-centric features. By adding an additional high-resolution visual encoder to capture fine details and fusing these details with the base visual features through a network called Conv-Gate fusion network. In addition, the boundary boxes identified by offline detectors are used to extract object-level features to enhance the model's object recognition capability. MG-LLaVA is instantiated on various language encoders ranging from 3.8B to 34B parameters and extensively evaluated on multiple visual and video benchmarks, demonstrating better performance than existing models, especially in tasks involving object recognition. The experimental results confirm the effectiveness of the MG-LLaVA design, indicating a significant enhancement in visual perception and understanding, surpassing models such as GPT-4V and GeminiPro-V. Overall, the paper attempts to address the problem of how to improve the performance of multimodal language models in processing complex visual inputs, particularly in enhancing the handling capability of high-resolution images and object recognition.