u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Jinjin Xu,Liwu Xu,Yuzhe Yang,Xiang Li,Fanyi Wang,Yanchun Xie,Yi-Jie Huang,Yaqian Li
2024-08-28
Abstract:Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize global or regional comprehension, with less focus on fine-grained, pixel-level tasks. To address this gap, we introduce u-LLaVA, an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. We commence by leveraging an efficient modality alignment approach, harnessing both image and video datasets to bolster the model's foundational understanding across diverse visual contexts. Subsequently, a joint instruction tuning method with task-specific projectors and decoders for end-to-end downstream training is presented. Furthermore, this work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also make our model, data, and code publicly accessible at <a class="link-external link-https" href="https://github.com/OPPOMKLab/u-LLaVA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of insufficient understanding capabilities of Multimodal Large Language Models (MLLMs) in fine-grained, pixel-level tasks. Although existing MLLMs have made significant progress in global and regional understanding, they perform poorly in tasks requiring fine-grained visual understanding. Specifically, these models have the following limitations: 1. **Fine-grained Visual Understanding**: Existing MLLMs mainly focus on global or regional understanding, with limited support for pixel-level tasks such as segmentation and object detection. 2. **High Data Demand**: Achieving regional understanding typically requires a large amount of training data, which increases training costs. 3. **Complex Module Design**: To achieve pixel-level understanding, specific segmentation modules need to be introduced or designed, adding to the model's complexity. To address these issues, the paper proposes u-LLaVA, an innovative unified multitask framework that enhances the perceptual capabilities of MLLMs by integrating pixel, regional, and global features. The main contributions of u-LLaVA include: - **Efficient Modality Alignment Method**: Enhances the model's foundational understanding capabilities using image and video data. - **Joint Instruction Tuning**: Introduces task-specific projectors and decoders at the same stage to achieve multi-level understanding. - **Public Dataset**: Released a masked multitask dataset containing 277K samples for evaluating and challenging the fine-grained perceptual capabilities of MLLMs. With these improvements, u-LLaVA achieves state-of-the-art performance in multiple benchmark tests, and the model, data, and code are all publicly available.