Abstract:Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{<a class="link-external link-https" href="https://github.com/lzw-lzw/UnifiedMLLM" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper primarily addresses the limitations of Multimodal Large Language Models (MLLMs) when handling different tasks and proposes a unified solution. Current MLLMs, although performing well on specific tasks, are usually designed for particular tasks and rely on specific task input-output formats, which limits their application across a broader range of tasks and overall generality. The paper introduces a model named UnifiedMLLM, which aims to achieve a unified representation and processing of various multimodal tasks by introducing task tokens and grounding tokens. Specifically, UnifiedMLLM can understand the implicit intent behind user instructions and output text responses as well as special tokens that indicate the task type and specific areas to be processed. These outputs are then routed to specialized expert models via a task routing component to perform specific tasks. To train UnifiedMLLM, the authors constructed a dataset for specific tasks and a multi-task dataset containing 100,000 complex scenarios. A three-stage training strategy was adopted: first, enabling the model to gain an understanding of multimodal inputs; then, training with specific task datasets to allow the model to understand human intent, perform reasoning, and complete various tasks; finally, using multi-round multi-task datasets to further optimize the model, ensuring it possesses strong understanding and reasoning capabilities. The main contributions of the paper include: - Proposing a unified task representation method by introducing task tokens and grounding tokens to represent different tasks and areas, thereby seamlessly integrating multiple tasks. - Constructing specific task datasets and multi-task datasets in complex scenarios, and proposing a three-stage training strategy to continuously enhance the model's understanding and reasoning capabilities while retaining existing knowledge and abilities. - Experimental results show that this unified approach performs excellently across various benchmarks, demonstrating the model's superior performance in handling multiple tasks and its cross-domain generalization ability.

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

InfMLLM: A Unified Framework for Visual-Language Tasks.

Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Unified Generative and Discriminative Training for Multi-modal Large Language Models

One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

MM-LLMs: Recent Advances in MultiModal Large Language Models

Uni3DL: Unified Model for 3D and Language Understanding

Unified Language Model Pre-training for Natural Language Understanding and Generation

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

Model Composition for Multimodal Large Language Models

Efficient Multimodal Large Language Models: A Survey

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Enhancing Subtask Performance of Multi-modal Large Language Model

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning