MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Xiangyu Zhao,Xiangtai Li,Haodong Duan,Haian Huang,Yining Li,Kai Chen,Hua Yang

2024-06-27

Abstract:Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at <a class="link-external link-https" href="https://github.com/PhoenixZ810/MG-LLaVA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper focuses on how to improve the performance of Multimodal Large Language Models (MLLMs) in handling visual understanding tasks, especially in processing high-resolution images and detailed information. Most existing models can only handle low-resolution images, which limits their effectiveness in tasks that require detailed visual information. The paper proposes a new model called MG-LLaVA, which introduces a multi-granularity visual flow, combining low-resolution, high-resolution, and object-centric features. By adding an additional high-resolution visual encoder to capture fine details and fusing these details with the base visual features through a network called Conv-Gate fusion network. In addition, the boundary boxes identified by offline detectors are used to extract object-level features to enhance the model's object recognition capability. MG-LLaVA is instantiated on various language encoders ranging from 3.8B to 34B parameters and extensively evaluated on multiple visual and video benchmarks, demonstrating better performance than existing models, especially in tasks involving object recognition. The experimental results confirm the effectiveness of the MG-LLaVA design, indicating a significant enhancement in visual perception and understanding, surpassing models such as GPT-4V and GeminiPro-V. Overall, the paper attempts to address the problem of how to improve the performance of multimodal language models in processing complex visual inputs, particularly in enhancing the handling capability of high-resolution images and object recognition.

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Valley: Video Assistant with Large Language model Enhanced abilitY

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding