Abstract:In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at <a class="link-external link-https" href="https://github.com/OFA-Sys/ONE-PEACE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct a general representation model that can be extended to infinite modalities. Specifically, the researchers proposed a model named ONE - PEACE, aiming to seamlessly align and integrate the representations among visual, audio and language modalities through a highly extensible design. This model has the following characteristics: 1. **Flexible architecture**: The architecture of ONE - PEACE must be flexible enough to adapt to various modalities and support multi - modal interaction. 2. **Modality - independent pre - training tasks**: Pre - training tasks need not only to extract information from each modality, but also ensure cross - modality alignment. 3. **General and simple pre - training tasks**: These tasks should be general and simple and can be applied to different modalities. To achieve the above goals, ONE - PEACE adopts an architecture that includes multiple modality adapters and a modality - fusion encoder. Each modality has an adapter for converting the original input into a feature sequence. The modality - fusion encoder is based on the Transformer architecture, and each Transformer block contains a shared self - attention layer and multiple modality feed - forward networks (FFNs). This design makes it possible to add a new modality simply by injecting the corresponding adapter and FFNs. During the pre - training stage, ONE - PEACE has designed two modality - independent pre - training tasks: - **Cross - modality contrastive learning**: It includes visual - language contrastive learning and audio - language contrastive learning, effectively aligning the semantic spaces of visual, audio and language modalities. - **Intra - modality denoising contrastive learning**: Combining mask prediction and contrastive learning, it enhances the fine - tuning performance of the model in downstream tasks by calculating the contrastive loss for fine - grained masked features and visible features. Through these designs, ONE - PEACE has not only achieved leading results in unimodal tasks (such as image classification, semantic segmentation) and multimodal tasks (such as audio - text retrieval, audio classification, audio question answering, image - text retrieval, visual localization), but also demonstrated its potential to be extended to infinite modalities.

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

OneLLM: One Framework to Align All Modalities with Language

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ViT-Lens: Towards Omni-modal Representations

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Explore the Limits of Omni-modal Pretraining at Scale

MIO: A Foundation Model on Multimodal Tokens

On-the-fly Modulation for Balanced Multimodal Learning

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Meta-Transformer: A Unified Framework for Multimodal Learning

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

UNIMO: Towards Unified-Modal Understanding and Generation Via Cross-Modal Contrastive Learning

OmniBench: Towards The Future of Universal Omni-Language Models

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

One for All: Toward Unified Foundation Models for Earth Vision

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE