Abstract:Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize global or regional comprehension, with less focus on fine-grained, pixel-level tasks. To address this gap, we introduce u-LLaVA, an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. We commence by leveraging an efficient modality alignment approach, harnessing both image and video datasets to bolster the model's foundational understanding across diverse visual contexts. Subsequently, a joint instruction tuning method with task-specific projectors and decoders for end-to-end downstream training is presented. Furthermore, this work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also make our model, data, and code publicly accessible at <a class="link-external link-https" href="https://github.com/OPPOMKLab/u-LLaVA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the issue of insufficient understanding capabilities of Multimodal Large Language Models (MLLMs) in fine-grained, pixel-level tasks. Although existing MLLMs have made significant progress in global and regional understanding, they perform poorly in tasks requiring fine-grained visual understanding. Specifically, these models have the following limitations: 1. **Fine-grained Visual Understanding**: Existing MLLMs mainly focus on global or regional understanding, with limited support for pixel-level tasks such as segmentation and object detection. 2. **High Data Demand**: Achieving regional understanding typically requires a large amount of training data, which increases training costs. 3. **Complex Module Design**: To achieve pixel-level understanding, specific segmentation modules need to be introduced or designed, adding to the model's complexity. To address these issues, the paper proposes u-LLaVA, an innovative unified multitask framework that enhances the perceptual capabilities of MLLMs by integrating pixel, regional, and global features. The main contributions of u-LLaVA include: - **Efficient Modality Alignment Method**: Enhances the model's foundational understanding capabilities using image and video data. - **Joint Instruction Tuning**: Introduces task-specific projectors and decoders at the same stage to achieve multi-level understanding. - **Public Dataset**: Released a masked multitask dataset containing 277K samples for evaluating and challenging the fine-grained perceptual capabilities of MLLMs. With these improvements, u-LLaVA achieves state-of-the-art performance in multiple benchmark tests, and the model, data, and code are all publicly available.

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

InfMLLM: A Unified Framework for Visual-Language Tasks.

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Audio-Visual LLM for Video Understanding

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization