OneLLM: One Framework to Align All Modalities with Language

Jiaming Han,Kaixiong Gong,Yiyuan Zhang,Jiaqi Wang,Kaipeng Zhang,Dahua Lin,Yu Qiao,Peng Gao,Xiangyu Yue
2023-12-07
Abstract:Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at <a class="link-external link-https" href="https://github.com/csuhan/OneLLM" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Multimedia
What problem does this paper attempt to address?
The paper attempts to address the issue that existing Multimodal Large Language Models (MLLMs) rely on modality-specific encoders, which are often architecturally different and limited to common modalities. The paper proposes a new framework called OneLLM, which aims to overcome the limitations of existing models by aligning eight modalities with language through a unified framework. Specifically, the main contributions of the paper include: 1. **Proposing a unified multimodal alignment framework**: Unlike modality-specific encoders in existing works, OneLLM uses a unified multimodal encoder and a Universal Projection Module (UPM), which can serve as a general and scalable component to handle various modalities. 2. **Integrating eight different modalities into a single model for the first time**: Through a unified framework and a step-by-step multimodal alignment pipeline, OneLLM can easily extend to more data modalities. 3. **Constructing a large-scale multimodal instruction dataset**: After fine-tuning on this dataset, OneLLM performs excellently on multimodal tasks, surpassing specialized models and existing MLLMs. Through these innovations, the paper aims to enhance the generality and scalability of multimodal large language models, enabling them to effectively understand and reason across a wider variety of modality data.