OneLLM: One Framework to Align All Modalities with Language

Jiaming Han,Kaixiong Gong,Yiyuan Zhang,Jiaqi Wang,Kaipeng Zhang,Dahua Lin,Yu Qiao,Peng Gao,Xiangyu Yue

2023-12-07

Abstract:Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at <a class="link-external link-https" href="https://github.com/csuhan/OneLLM" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Multimedia

What problem does this paper attempt to address?

The paper attempts to address the issue that existing Multimodal Large Language Models (MLLMs) rely on modality-specific encoders, which are often architecturally different and limited to common modalities. The paper proposes a new framework called OneLLM, which aims to overcome the limitations of existing models by aligning eight modalities with language through a unified framework. Specifically, the main contributions of the paper include: 1. **Proposing a unified multimodal alignment framework**: Unlike modality-specific encoders in existing works, OneLLM uses a unified multimodal encoder and a Universal Projection Module (UPM), which can serve as a general and scalable component to handle various modalities. 2. **Integrating eight different modalities into a single model for the first time**: Through a unified framework and a step-by-step multimodal alignment pipeline, OneLLM can easily extend to more data modalities. 3. **Constructing a large-scale multimodal instruction dataset**: After fine-tuning on this dataset, OneLLM performs excellently on multimodal tasks, surpassing specialized models and existing MLLMs. Through these innovations, the paper aims to enhance the generality and scalability of multimodal large language models, enabling them to effectively understand and reason across a wider variety of modality data.

OneLLM: One Framework to Align All Modalities with Language

ModaVerse: Efficiently Transforming Modalities with LLMs

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

InfMLLM: A Unified Framework for Visual-Language Tasks.

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

DreamLLM: Synergistic Multimodal Comprehension and Creation

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

NoteLLM-2: Multimodal Large Representation Models for Recommendation

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

NVLM: Open Frontier-Class Multimodal LLMs

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs